Data is here to stay. In fact, your current data has already outlived or will most likely outlive its current systems and processes. What you need to be mindful of is that every time data goes through a data integration process, there are chances of errors. These are the 5 data integration processes prone to creating bad data quality:
1. Data warehousing
As there might be ambiguity around this term, I’m referring to the technique of processing, transformation and ingestion of data from one or more sources into a data warehouse, data mart or even an operational data store (ODS). Every time there is a data transformation step, errors can occur due to data type changes (ex: double to integer), incorrect semantics, and so on. If the data warehouse loads data from multiple sources, the chance of creating bad data quality grows.
2. Data migration
When you are switching to a new system or you are importing new data sets into your CRM, CMS, ERP, etc., you go through data migration. As opposed to data warehousing, this is done as a one time project. Since these type of projects have set deadlines, the data migration task does not include data profiling, data quality assurance, data modeling, creating data definitions and so on. Why? It would add too much project cost and effort. The consequence to this decision is that you will migrate data in its current state, but also have to transform it to adapt to the new data architecture. I can tell you from my experience of coming into these projects after data migrations, that it can take years to cleanse this data.
Improve your reference data management. Adopt these 5 best practices for managing reference data.
3. Data consolidation
There is a tendency to centralize systems as much as possible, for obvious reasons, though centralization requirements also occur due to acquisitions of another company. When this happens, data will need to be consolidated and often merged into a single system. Most bad data quality will occur due to different business definitions and rules between the two organizations or units governing the two systems. The lack of quality control and assurance over one of the two databases will also have a big impact.
4. Data synchronization
Synchronizing data between 2 different databases is one of the most challenging aspects of maintaining data quality. Errors often occur either because of:
- The time variance: not knowing which one of the two records is most up to date. Most data models don’t have a “last updated date” at the column level so synchronizing particular data elements can be tricky and require solid business rules.
- The data architecture: converting a data type into another can create a data loss, though most often I see issues when data in database A is recorded at a more detailed level than the data in database B. For example an address in database A has its address elements recorded in individual columns (ex: address line 1 and 2, city, state, etc.) whereas in database B is simply recorded into a single one: address.
5. Master reference data services
Similar to the data synchronization, master reference data services are bi-directional between the master reference data system and the business systems. Much alike the examples above, these data integration processes can introduce data defects.
All of the data integration processes listed above can bring plenty of benefits to better support your business needs. We just have to be aware of the high risks of introducing bad data quality in these processes. In particular, we should pay attention to defining data and business semantics, business requirements and rules, the time variance, as well as changes to systems, data, and business needs.