Role of Data Management (and MDM) in Analytics
In the last blog we dissected the important role that analytics plays in Master Data Management (MDM). But that pops another question – flipping the script, does MDM or broadly data management play a role in analytics?
To answer that question, lets look at one popular methodology that data scientists employ called CRISP-DM (Cross Industry Standard Process for Data Mining).
The CRISP-DM Model
A diagrammatic representation the CRISP-DM model that’s immensely popular with analytics experts demonstrates the six steps this model recommends to mine and analyze data. Step-2 (Data Understanding or Data Study) and Step-3 (Data Preparation) involve Data Management activities. A key goal of data management is to integrate and co-locate relevant data in a warehouse repository and improve its quality (see earlier blog for more details). Steps 2 and 3 of the CRISP-DM model address these exact goals. Step-3 also involves another key activity – Data Modeling (not to be confused with step-4 (Modeling) in the below diagram which refers to creation of statistical or analytical models). To properly achieve the goal of improved data quality, the warehoused data needs to be arranged in data models which identifies the entities, attributes, relationships and hierarchies of the data especially as it relates to Master data.
So now let’s ask again. Does data management and MDM have a part to play in analytics projects? If almost half of the most popular data science model involves data management activities like we just discussed then shouldn’t the answer be an emphatic YES?
Now that we have established not only is data management deeply integrated into analytics, lets ask the real million-dollar question – why is role of data management in analytics so widely undermined and even ignored?
A question of quality of analytics
Poor data quality leads to poor quality of analytics. That’s obvious and every data scientist knows this. But in the absence of a traditional data management solution and lack of specialized treatment of data based on its type like master, transaction, reference etc. – data scientists are handicapped in trying to improve data quality. Therefore, they employ rudimentary techniques using tools like Excel and scripting in languages like Python to integrate and cleanse data. However, this approach is unproductive, often ineffective and as the data sources and volume grows almost always unscalable. What a data scientist needs to produce quality analytics solutions is a sound data management strategy and mature enterprise grade data management tools.
Talking about data quality, an important point needs to be called out here before we move on. A foundational step in data management is data modeling. This is different from statistical and analytical modeling that data scientists deploy for analytics. Without going into depth, its is suffice to say that data modeling improves data quality thus greatly improving the ensuing analytical models that data scientists will construct.
So, what’s the problem?
In introducing data management into analytics that is. From all my experience in participating in data management projects and working with data scientists it comes down to these things in my opinion:
- Awareness (or lack thereof) – just like the data management experts failed to realize the potential of analytics like I explained in my last blog, analytics experts too have been generally unaware that there exist data management solutions in the market which specialize in data quality and governance.
- Collaboration (or lack thereof) – even in companies where there are data management and analytics initiatives and entire organization dedicated towards them, there is little to no collaboration and cooperation between the two. This ties back to lack of awareness but more importantly failure on the managements part to recognize the synergies between the two.
- Scale (or lack thereof) – many analytics projects are limited in scope and involve smaller and siloed datasets which just doesn’t exhibit an unwieldy data quality problem and hence does not warrant an enterprise grade data management system. But even in this situation as the company grows and ages the data volumes expand and the need to integrate data from different sources and domain increases thus justifying an investment in a commercial data management solution.
Finally – the real definition of data science
The general perception about data science is it is all about analytics and often involves ML and AI. However, as seen from the CRISP-DM model this is a misconception. Data science also often involves data management (in one form or the other) which is a precursor to the ensuing analytics. Even if a data science initiative does not always employ a enterprise grade data management solution, still key data management related activities to improve data quality need to be performed. Therefore, the real definition of data science is:
Data Science = Data management + Analytics!