How learning to use Pareto analysis can improve your data quality

In the data quality line of work it is very rewarding to get to the root cause of the issue and fixing it, along with the data. More often than not, there are long lists of data quality issues to resolve, but how do you quickly prioritize what you will tackle first? This can be a long topic to cover, but here is a quick overview of the Pareto analysis and how learning it can help you make those decisions and improve your data quality.

Definition

A tool and technique to help identify the top portion of causes that need to be addressed (20%) in order to resolve the majority portion (80%) of the problem.

Synonym(s):

Pareto chart, Pareto diagram, 80/20 rule

Description

Through the lens of data quality, the Pareto analysis essentially states that 80% of poor data quality is caused by 20% of the issues encountered during the data supply chain (acquisition, creation, transformation, maintenance, dissemination, retirement). Pareto charts are one of the seven basic tools of quality, i.e those graphical techniques identified as being most helpful in troubleshooting issues related to quality.

pareto analysis

Fun fact

The Pareto analysis was put into practice in the 1940s by Joseph Juran, a quality control pioneer. He showed that 80% of the qualitative defects of errors were stemming from 20% of the problems. Why is it called Pareto? Because in the early 1900s, it was Vilfredo Pareto, who through extensive researched deduced that 80% of the wealth of Italy came only from 20% of its population. Over the years this principle and theory was highly cited and used in the economic space and later in quality and project management. Basically this 80/20 rule says that 20% of inputs drive 80% of results.

When to use

To determine the frequency of causes for poor data quality
When there are many or poor data quality causes and you want to focus on the most significant
To prioritize what data quality issues to first focus on
To create a communication medium and visualize the 80/20 effect

Pros

A simple tool which does not require training
Helps set priorities for groups of data quality issues
Saves time and resources by concentrating your data quality improvement efforts where it has the most impact
Works well with other methods and techniques, such as the fishbone diagram

Cons

This is highly dependent on the way one categorizes issues and also on the accuracy of the categorization
It only takes into account occurrences, not necessarily risks and costs associated with it – though there are variations which include these factors

Avoid losing track of data quality issues. Here is a free data quality issues log.

Steps to develop it

Classification of issues: For the identified undesired outcome (ex: returned mail), create a list of causes and categorize them. (ex: customer returned the mail, incorrect address, incomplete address, deceased customer, etc.). This can be achieved through a fishbone diagram and/or focus group meetings, brainstorming sessions, surveys, etc.
Collect data:Decide on a time period for which the data pertaining to this undesired outcome will be collected and start adding the number of occurrences in each of the categories identified. In the returned mail example, you can look at the return reason on the envelope as identified by the mail carrier or the mail recipient.
Create the Pareto graph: Sort your categories by the number of occurrences found in each one. For each category, create a cumulative percentage. Create a combined graph with a horizontal axis and two vertical axes on either sides of your screen. Plot the left vertical axis with increments starting from zero and ending up to the highest number of occurrences. On the right vertical axis plot the cumulative percentage with increments starting with zero and ending in 100%. There are a few tutorials on how to do this in Excel so I won’t recreate this myself. Here’s one you can follow along:

Example

Here is a Pareto Analysis showing the number of occurrences of the 5 main causes for returned mail, sorted by the number of occurrences. You can quickly identify the main 2 causes: incorrect address format and incomplete address, as they lie to the left of the 80% cut-off mark. When resolved, it will solve 80% of the problems.

pareto analysis

Tips

Use this early in your data quality improvement process
Use it to build consensus if there is uncertainty or disagreement around what the data quality improvement priorities should be
The 80/20 ratio is merely a convenient rule of thumb and you should not consider it an immutable law of nature

Tools

Microsoft Excel or PowerBI
Tableau Software
Other data visualization tools

A variation of the Pareto analysis is to look at the cost and/or risk of each occurrence, not just the number of occurrences. This can help you identify those data quality issues which are causing more expensive or high risk problems.

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

How learning to use Pareto analysis can improve your data quality

Definition

Synonym(s):

Description

Fun fact

When to use

Pros

Cons

Avoid losing track of data quality issues. Here is a free data quality issues log.

Steps to develop it

Example

Tips

Tools

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

How to select a Data Governance Maturity Model?

5 main reasons to leverage a data governance maturity model

TDWI data governance maturity model