Every industry and workload has its mythology. It starts with us – people believing in certain theses or taking someone else’s opinion as the truth.
Through years I’ve been working on many different projects related to data and analytics: web apps with database back-end, complex ERP systems, Big Data, data warehousing and Business Intelligence solutions. And myths have always been there, spread by my coworkers or Customers. I’ve decided to pick some of them and share my thoughts on why actually I think they are myths. So let’s dive into mythology for a moment…
Myth #1: Data analytics is always complex and expensive
We’ve been learned through decades that platforms for data management and analytics are complex, consist of many layers and require multiple skills to build solutions on top of them. All those ETL/ELT tools and processes, data lakes and warehouses, advanced analytics and big data, sophisticated BI tools and more… Sounds like a huge invoice for the platform (infrastructure and software) plus another solid bill for the services of skilled professionals. Well, that could be a case. If the organization really requires all those things mentioned (an end-to-end platform for analytics) and you don’t own the required skills to build the platform and solutions, that costs money (and I’m a big advocate of engaging the professionals to do their job – build a solution and help the organization to build internal skills). One day your high level architecture may look like below (for full reference see Azure data platform end-to-end):
But there is also good news – you can start small and build your world of data analytics piece by piece, depending on your organization’s size and maturity in data literacy and analytics (see below how Gartner grades the analytics maturity).
Typically, the very first types of analytics to cover are the two on the left side of the above chart: descriptive and diagnostic analytics (what and why happened – analysis based on historical data). That means, the first analytics project is typically around analytics and reporting based on data from core transactional systems of the organizations.
Another good news is that you can use different cloud services to implement the platform quickly, without investment in proprietary hardware or software, aligned to your strategic milestones and with many options to grow the solution when your analytics reaches the next level. Again, you can start small and land with a basic set of tools and services to address the most urgent requirements of the business, get buy in of business stakeholders by proving the analytics can bring business value and support business decisions.
Here is one of the easiest implementations of analytics in a mid-sized organization (a retail company running several hundred stores) having one core system (ERP) with additional data coming in Excel files:
Note there are no back-end Azure services in this architecture. The only component of the platform for analytics is Power BI which is responsible for data ingestion and transformation, automatic data refresh, data modeling (a mash-up of multiple data sources) and reporting (including self-service and interactive analytics). In a real life scenario, the Customer decided to involve a read-only replica of ERP system database to avoid additional load on the transactional system. The next step would be to add some services for building a data layer when the data grows and the limitations of Power BI datasets are met. And guess what, it still doesn’t have to cost millions to run Azure SQL or Azure Synapse Analytics as your data layer. We’ll try to get back to this topic with Bartek Graczyk at SQLDay 2021 🙂
Myth #2: Big data and data warehousing are for Enterprise only
No, no, no, and once again no! Your organization does not have to struggle with all the “Vs” of big data (some mention even 10 Vs of which “volume”, “variety” and “velocity” are the most recognized ones) to benefit from building the integrated platform for analytics of which data warehouse and/or data lake can be important parts. Take a look at some thoughts of James Serra (congrats on the next step in professional career!), one of my favorite bloggers on big data and data warehousing, to find the major reasons for data lake and data warehouse in your data platform. And yes, it’s not a mistake – I’m referencing blog posts published back in 2017 and 2013 (!) respectively 🙂
I’ve seen companies running data warehouses completely purged and refreshed every night, yet still bringing their values – data integration, uniform dimensions and KPIs for analytics, single trusted source for reporting and analytics. I’ve seen small companies having only a few data sources for analytics, running their data lake for the sake of speed and agility, as well as for the analytics over semi-structured and unstructured data.
And one last important note on this one – building and maintaining data warehouse and/or data lake does not require spending millions of dollars on infrastructure and software. Thanks to a broad offer of PaaS cloud services like Azure Data Lake Storage, Azure SQL Database (including serverless option), Azure Databricks, and Azure Synapse Analytics (yeah, that’s right – running Synapse does not always mean a high bill and really depends on the use case; also see the latest announcement of time limited free monthly quantities for the serverless and Spark engines) big data analytics does not have to be super expensive and becomes a common scenario.
Myth #3: One data model to rule them all and to get rid of data silos
While I believe a single version of truth is a holy grail of the analytics and a distant, often impossible to achieve level of the so-called “data-driven company” that organizations should strive for, I don’t think it’s just a matter of technology or data modeling. Yes, I think running a central data platform for integration, governance and serving common data to the whole organization is a good idea. But…
I’ve seen large companies achieving very high maturity of analytics and getting true value from data while running several data warehouses (subject/business area oriented) and/or data models created in BI tools (for many different reasons – from a history of the organization to the limitations of the platforms used for analytics). I would say getting rid of data silos is more about how the organization manages the definitions of data entities, dimensions, KPIs and how business users use those data related artifacts than how the architecture of the data platform itself looks like.
To wrap up this myth – while having one central data platform with a single data lake and/or data warehouse and a common set of data flows is typically the best approach, I don’t think it makes sense (and it’s not always possible due to technical limitations!) to enforce one data model in each tier of the analytics platform. Common shared definitions of major data entities plus unified processes for working with data can bring more benefits. Governance, stupid!
Myth #4: Modern self-service BI tools are here to replace Excel
Oh my, this is one of THOSE religious wars – Excel vs BI tools 🙂 Some people claim that modern BI tools are going to replace Excel, especially when it comes to analytics. While in general I think there are many areas in which platforms like Power BI can successfully automate tasks we used to perform in the spreadsheets, I don’t think anybody should consider BI tools to replace tools like Excel.
The reason is quite simple: there are some areas in which editable spreadsheets allowing quick modification of data and work with wide flat tables and/or pivot tables will outperform practically any BI tool available on the market. A good example is finance and controlling, especially when it comes to planning and budgeting, when data visualization is usually not the top priority (but still BI tools can be used to automate data integration, analyze actuals vs budget, visualize trends, enable collaboration of business users and trigger actions when your metrics reach their thresholds).
From my experience picking the best tool for each task and using wide integration of BI tools with Excel typically brings the most benefits. Just make sure you automate what can be done without manual work (which often is a source of most of errors and inconsistencies in data).
Myth #5: Data governance is about tools and platforms
Some people believe that data governance starts with the right technology – a set of specialized platforms or tools allowing to manage different aspects of data, like quality, glossary, business rules.
But you know what? I’ve seen large enterprises successfully managing master data and data quality with very simple custom solutions built with common tools like SQL databases and intranet portals (SharePoint). It wasn’t a perfect solution, but it did the job, because there were dedicated people involved in the data governance, responsible for data common data entities and KPIs, and the processes around were simplified and – most of all – well known and followed by everyone involved in data and information management.
That leads me to a point – data governance starts with the organization culture, with people and processes, with data ownership and responsibility, and not with sophisticated platforms. And – now, this one is huge – it’s a long term process and strategic organizational effort rather than one-time deployment.
One of the most comprehensive guides for data governance is the Data Governance Framework provided by the Data Governance Institute. See here to find some comments and best practices for data governance ((based on the DGI framework) by Profisee, a well-known vendor of master data management solutions: Data Governance – What, Why, How, Who & 15 Best Practices (profisee.com). In my opinion the article and the infographic shown below perfectly describe how wide the topic of data governance is and how many different aspects of the organization and data management it touches. Notice one thing – there is no technology mentioned on this picture!
Myths related to the world of data and analytics have been and will be an inseparable part of our reality. However, it’s up to us whether we will duplicate them in our organizations or try to fight them through constant learning (“fail fast, learn, repeat”) and sharing the best practices.
I realize with this post I’m only scratching the surface. It’s obvious you can share tons of other common myths and I encourage you to do so (looking forward to reading your comments!). Only through the constant sharing of knowledge we can educate ourselves and our companies not to make mistakes that have already been made many, many times.