Data quality myth: our data is good

There are quite a few data quality myths that need to be dispelled in order to move forward and mitigate the data quality risks. Last year I’ve covered 4 myths about Data Quality everyone thinks are true that started an entire trend on LinkedIn and also sparked a series of YouTube videos. So, here is the next data quality myth that we need to understand and debunk:

Myth #5: Our data is good

Do you think your organization has good data quality? I heard a positive answer for this question more often than I expected. It’s a good thing, sure, but here’s why a lot of organizations think that their data quality is good and why that’s not the reality, why It’s a myth!

Many organizations can’t simply accept the fact that they may have an issue with the quality of their data. Let me outline some of the reasons why that is:

1. Lack of information

Let’s get the obvious out of the way, which is the lack of information. Some business executives suffer from this: they are shielded and unaware that their company’s data quality is poor. Sometime it could be that they are misinformed about it or choose not to care. Either, or, is not a good scenario to be part of.

2. Data migration

For the second reason, let’s consider a data set that contains bio data for customers: names, dates of birth, etc. It could be that this data set is of good quality in application A, in database A, in source system A. So now when you’re talking to a data steward or a data custodian that are focused on the application/database/source system A, they will tell you: “My data is good”. That could very well be the case, but within the same organization once that data set is transferred over to another database or is consumed by another application, most likely, its quality will drop. Especially if you don’t have a data governance program and a data quality program in place. It’s unavoidable. Without these programs, even though the source application does a good job and understands the business rules and exceptions by which this data set needs to abide to, the same can’t be said about the target application. When data is migrated to a new system, there’s a high chance it will be transformed into something that will be in conflict with those business rules and none is the wiser.

3. Metadata

For the third reason, I remember having this conversation with a data professional and he was bragging about the good quality of their data even though they didn’t have a data governance program. I was happy to hear, but also a bit skeptical because they didn’t have a data governance program. And I did have a chance to poke around a bit in one of their databases and do a bit of data profiling. In a way he was right, the quality of the data was good, but boy was he wrong. Let me give you an example of what I mean. They were storing delivery information in address fields and even name fields. The delivery information such as instructions on where to leave packages, what the buzzer number is and so on, was accurate, but it was stored in the wrong fields. Not to even mention about its consistency. So, in order to have good quality data you should look at all data quality dimensions and consider its metadata as well. I mean, how many times did you find interesting information in names and dates fields? You would think that the date field was actually of type date, but you would be surprised.

4. Decay factor

For the forth reason, you could actually have good quality data today, but that doesn’t mean it will retain its quality tomorrow. Back to the bio data example, if you’re storing the age of an individual and not their date of birth, if it’s not updated automatically, it will be incorrect past their next birth date. Just as I addressed in the second data quality myth video, even if you cleanse your data so that it is clean now, it won’t be tomorrow as the quality for certain data will decay just by sitting there.

5. Post-cleansing

Lastly, and in a way this ties back to the first reason, a lot of data quality issues are “fixed” in the ETL phase before it gets outputted in reports and dashboards or other areas where they are available for human consumption. The problem is that the data quality efforts are not also serving the system of record or system of origin. If there is any control over that data, then the data in these systems should also be cleansed. Most of the time they are not and when another data integration project occurs, the data quality efforts need to be replicated, though a lot of times they can be forgotten. The bad data quality then appears in these new environments if it was not cleansed at the source.

So, is your data clean or do you just think it is?

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

Data quality myth: our data is good

Myth #5: Our data is good

1. Lack of information

2. Data migration

3. Metadata

4. Decay factor

5. Post-cleansing

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance