One of my favourite quality stories is when, just as W. Edwards Deming was about to begin his 4-day seminar on improving quality at a large enterprise, one of the executives scheduled to take the course came up to Deming and said that he was so busy that week that he could not possibly spend 4 days at the seminar. So, he asked Deming to distill the main message of the seminar to a few words so that the executive didn’t have to ‘waste’ 4 days. In the story, Deming smiled and said, “You should focus on reducing variation”.
The amazing thing to me about this story is that those 2 words do such a great job of distilling the ideas behind what Deming called ‘profound knowledge‘ and today that ‘reduce variation‘ advice is still as applicable to the manufacture of cars on an assembly line as it is to the creation and management of policyholder records in an insurance company’s database. Specifically, Deming would tell the car manufacturer to reduce variations in all the parts and assembly processes that go into making a car that customers would want to buy, just as he would tell a life insurance executive to measure, analyze, improve and control the data and processes that are involved in the creation of a life insurance policy record in a database.
When I teach my 6-week course on Data Quality Improvement at BCIT, I try to get the students thinking about how the processes involved in the production of data can be continuously improved. We’ve only got 6 weeks so the focus is on teaching them how to use a simple data profiling tool to quickly expose possible data quality issues and evolve the DQ issues into Data Quality Rules (DQRs) using simple Excel charts such as Statistical Process Control and Process Behaviour Charts to show how the data quality issues vary over time.
The Data Profile of a data set from a Business Process provides the characteristics of each column in the data set, as well as the ‘behaviour’ of that column’s data over time. For each column in the data set, the profile illustrates basic quality characteristics such as average, variance, patterns, masks, distributions, outliers, et cetera. Using these, the DQ Analyst can prepare a short list of ‘interesting characteristics’ about the data set and ask a subject matter expert, i.e. a person familiar with the business process and its outputs, to explain the significance, and priority, of each Poor Data Quality characteristic (PDQ). For example, this image shows that the Credit Card numbers, when entered, are consistently 16 digits except for 5 incorrect patterns.
With the prioritization(s) in hand, the DQ Analyst can evolve the significant DQ issues into DQ Rules by analyzing the occurrence of the DQ issue over time to calculate the average number of occurrences per period and the variation of the occurrences over time. The average and variation define the upper and lower occurrence thresholds, and when the DQ Rule Violation counts are charted over time, the graph almost magically shows what Donald Wheeler calls the ‘Voice of the Process‘.
Here, the DQR#1723 is run against the data at the end of each month and the number of ‘no address’ DQR Violations is recorded. In this chart you can see that there has been an average of 165 customer records without an address created each month and though there was a major variation in March of 2014, overall this DQ issue is stable.
Since the past is the best predictor of the future, a person wanting to understand how many customer records would be entered into the system without addresses next month could safely estimate 165 plus or minus 15. Now that they are aware, the organization can decide that the ‘no address’ issue is significant and begin to continuously improve the upstream processes to ensure customer addresses are entered. In this chart of 2015-16 we can see that the number of ‘no address’ records on average has dropped to 140 per month, and there has been a steady improvement until the last month of each year.
To quote Donald Wheeler:
“The characterization of a process as either predictable or unpredictable is a fundamental dichotomy for data analysis”.
Simply put, if the process is not predictable, how can you manage it. Once you measure the baseline variability of the dataset generated by the business process, you can go further and measure the variability of the DQ characteristics that have a significant impact on the integrity and usability of the business process information. With the added depth of baseline measurements of the ‘noise’ of the DQ Rule violations, your organization’s decision making will steadily improve.