9 questions to ask for data veracity assessment

There are different takes on what veracity refers to. The overall consensus is that data veracity reflects the truthfulness of a data set and your level of confidence or trust in it. I’ll take this a step further and say that data veracity is your level of confidence/ trust in the data based on its provenance as well as the data processing method.

Think about this: when you get a box of chocolate which you haven’t tried before, how do you estimate how good it is? The first step is to look where it was made, by what shop or brand. You can mainly assess its quality by its provenance. As a second step, you probably also want to ensure that after you open the box, you won’t taint the chocolates somehow before you taste them.

Data veracity helps us better understand the risks associated with analysis and business decisions based on a particular big data set.

Looking at a data example, imagine you want to enrich your sales prospect information with employment data — where those customers work and what their job titles are. Not only this can provide you with additional contact data, but it can also help you create different market segments and do a better job of serving them.

LinkedIn collects lots of employment data, but unfortunately you can’t purchase it from them. So what can you do? You might go to another third-party provider of who claims to scrape LinkedIn data from search engine results (a legally grey area in my opinion, at least at the time this article is written; I’m not a legal expert so let’s just treat this as a theoretical example). Therefore, you might consider purchasing this LinkedIn employment data, but how do you gauge its veracity?

Well, here are the 9 questions to ask the data provider to help you better assess the data veracity:

Who created the original data source?
Who contributed to the data source?
When was the data collected?
Was the original data source enriched in any way?
What methodology did they follow in collecting the data?
What algorithm did they use to match records and what are the matching confidence levels?
Were only certain industries or locations included in the data source?
Has the information been edited or modified in any way?
Did the creators summarize the information?

After answering all these questions you will also need to understand how, where, and when you will integrate this data with your own. What are the definitions, extract, transform, and load (ETL) procedures, and business rules which you will follow?

Answers to these questions are necessary to determine the veracity of this big data source. To expand on the employment data example, what if your customer base only included lawyers? Well, then you wouldn’t choose LinkedIn as your data source but rather go to the American and/or Canadian Bar Association. Why? Because the bar associations have a higher data veracity for this type of data than one that is self-reported.

Veracity is impacted by human bias and error, lack of data governance and data validation, software bugs which can lead to duplication and variability, volatility, and lack of security. We all wish for these to be addressed as we consider them important, at least in theory, but the reality is that not all data vendors monitor these variables enough to fully address them and follow the trifecta of data quality management. That’s probably why IBM Big Data & Analytics Hub estimates poor data costs the US economy $3.1 trillion every year.

Veracity is rarely achieved in big data due to its high volume, velocity, variety, variability, and overall complexity. In turn, we take solace in understanding that knowledge of data’s veracity helps us better understand the risks associated with analysis and business decisions based on a particular big data set. So, find out as much as possible about your data sources, big and small, to better gauge the veracity.

A similar version of the article was orginally published for ExagoBI

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

9 questions to ask for data veracity assessment

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

9 questions to ask for data veracity assessment

Mining Big Data empowers doctors to improve the outcomes

Adapting data governance to big data in 4 areas