In the data quality line of work it is very rewarding to get to the root cause of the issue and fixing it, along with the data. More often than not, there are long lists of data quality issues to resolve, but how do you quickly prioritize what you will tackle first? This can be a long topic to cover, but here is a quick overview of the Pareto analysis and how learning it can help you make those decisions and improve your data quality.
Definition
A tool and technique to help identify the top portion of causes that need to be addressed (20%) in order to resolve the majority portion (80%) of the problem.
Synonym(s):Â
Pareto chart, Pareto diagram, 80/20 rule
Description
Through the lens of data quality, the Pareto analysis essentially states that 80% of poor data quality is caused by 20% of the issues encountered during the data supply chain (acquisition, creation, transformation, maintenance, dissemination, retirement). Pareto charts are one of the seven basic tools of quality, i.e those graphical techniques identified as being most helpful in troubleshooting issues related to quality.
Fun fact
The Pareto analysis was put into practice in the 1940s by Joseph Juran, a quality control pioneer. He showed that 80% of the qualitative defects of errors were stemming from 20% of the problems. Why is it called Pareto? Because in the early 1900s, it was Vilfredo Pareto, who through extensive researched deduced that 80% of the wealth of Italy came only from 20% of its population. Over the years this principle and theory was highly cited and used in the economic space and later in quality and project management. Basically this 80/20 rule says that 20% of inputs drive 80% of results.
When to use
- To determine the frequency of causes for poor data quality
- When there are many or poor data quality causes and you want to focus on the most significant
- To prioritize what data quality issues to first focus on
- To create a communication medium and visualize the 80/20 effect
Pros
- A simple tool which does not require training
- Helps set priorities for groups of data quality issues
- Saves time and resources by concentrating your data quality improvement efforts where it has the most impact
- Works well with other methods and techniques, such as the fishbone diagram
Cons
- This is highly dependent on the way one categorizes issues and also on the accuracy of the categorization
- It only takes into account occurrences, not necessarily risks and costs associated with it – though there are variations which include these factors
Avoid losing track of data quality issues. Here is a free data quality issues log.
Steps to develop it
- Classification of issues: For the identified undesired outcome (ex: returned mail), create a list of causes and categorize them. (ex: customer returned the mail, incorrect address, incomplete address, deceased customer, etc.). This can be achieved through a fishbone diagram and/or focus group meetings, brainstorming sessions, surveys, etc.
- Collect data:Decide on a time period for which the data pertaining to this undesired outcome will be collected and start adding the number of occurrences in each of the categories identified. In the returned mail example, you can look at the return reason on the envelope as identified by the mail carrier or the mail recipient.
- Create the Pareto graph: Sort your categories by the number of occurrences found in each one. For each category, create a cumulative percentage. Create a combined graph with a horizontal axis and two vertical axes on either sides of your screen. Plot the left vertical axis with increments starting from zero and ending up to the highest number of occurrences. On the right vertical axis plot the cumulative percentage with increments starting with zero and ending in 100%. There are a few tutorials on how to do this in Excel so I won’t recreate this myself. Here’s one you can follow along:
Example
Here is a Pareto Analysis showing the number of occurrences of the 5 main causes for returned mail, sorted by the number of occurrences. You can quickly identify the main 2 causes: incorrect address format and incomplete address, as they lie to the left of the 80% cut-off mark. When resolved, it will solve 80% of the problems.
Tips
- Use this early in your data quality improvement process
- Use it to build consensus if there is uncertainty or disagreement around what the data quality improvement priorities should be
- The 80/20 ratio is merely a convenient rule of thumb and you should not consider it an immutable law of nature
Tools
- Microsoft Excel or PowerBI
- Tableau Software
- Other data visualization tools
A variation of the Pareto analysis is to look at the cost and/or risk of each occurrence, not just the number of occurrences. This can help you identify those data quality issues which are causing more expensive or high risk problems.