5 data integration processes prone to creating bad data quality

data integration processes prone to bad data

Data is here to stay. In fact, your current data has already outlived or will most likely outlive its current systems and processes. What you need to be mindful of is that every time data goes through a data integration process, there are chances of errors. These are the 5 data integration processes prone to creating bad data quality:

1. Data warehousing

As there might be ambiguity around this term, I’m referring to the technique of processing, transformation and ingestion of data from one or more sources into a data warehouse, data mart or even an operational data store (ODS). Every time there is a data transformation step, errors can occur due to data type changes (ex: double to integer), incorrect semantics, and so on. If the data warehouse loads data from multiple sources, the chance of creating bad data quality grows.

2. Data migration

When you are switching to a new system or you are importing new data sets into your CRM, CMS, ERP, etc., you go through data migration. As opposed to data warehousing, this is done as a one time project. Since these type of projects have set deadlines, the data migration task does not include data profiling, data quality assurance, data modeling, creating data definitions and so on. Why? It would add too much project cost and effort. The consequence to this decision is that you will migrate data in its current state, but also have to transform it to adapt to the new data architecture. I can tell you from my experience of coming into these projects after data migrations, that it can take years to cleanse this data.

Improve your reference data management. Adopt these 5 best practices for managing reference data.

3. Data consolidation

There is a tendency to centralize systems as much as possible, for obvious reasons, though centralization requirements also occur due to acquisitions of another company. When this happens, data will need to be consolidated and often merged into a single system. Most bad data quality will occur due to different business definitions and rules between the two organizations or units governing the two systems. The lack of quality control and assurance over one of the two databases will also have a big impact.

4. Data synchronization

Synchronizing data between 2 different databases is one of the most challenging aspects of maintaining data quality. Errors often occur either because of:

The time variance: not knowing which one of the two records is most up to date. Most data models don’t have a “last updated date” at the column level so synchronizing particular data elements can be tricky and require solid business rules.
The data architecture: converting a data type into another can create a data loss, though most often I see issues when data in database A is recorded at a more detailed level than the data in database B. For example an address in database A has its address elements recorded in individual columns (ex: address line 1 and 2, city, state, etc.) whereas in database B is simply recorded into a single one: address.

5. Master reference data services

Similar to the data synchronization, master reference data services are bi-directional between the master reference data system and the business systems. Much alike the examples above, these data integration processes can introduce data defects.

Conclusion

All of the data integration processes listed above can bring plenty of benefits to better support your business needs. We just have to be aware of the high risks of introducing bad data quality in these processes. In particular, we should pay attention to defining data and business semantics, business requirements and rules, the time variance, as well as changes to systems, data, and business needs.

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

5 data integration processes prone to creating bad data quality

1. Data warehousing

2. Data migration

Improve your reference data management. Adopt these 5 best practices for managing reference data.

3. Data consolidation

4. Data synchronization

5. Master reference data services

Conclusion

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

What data quality testing skills are needed for data integration projects?

What you need to know about regression testing on DW/ BI projects

How to use the barrier analysis for improved data quality