How to implement data profiling for successful source data discovery
Effective Data Warehoues (DW) data source profiling is often an overlooked step in data warehouse data preparation. DW project teams need to understand all quality aspects of source data before preparation for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your source data sets.
The first purpose of this profiling analysis is to decide if the data source is even worth including in your project. As data warehouse guru Ralph Kimball writes in his book The Data Warehouse Toolkit, “Early disqualification of data sources is a responsible step that can earn you respect from the rest of the team, even when it seems to be bad news.”
If the data source is deemed worthy of inclusion, results from profiling this source will help you evaluate the data for overall quality and estimate the ETL work necessary to cleanse the data for downstream analysis.
A leading cause of data warehousing barriers during planning and development is extracting erroneous or poor quality source data as input to data warehouse ETLs.
Data discovery, data mappings & design, ETL development
Typical ETLs extract data from sources and loads it to targets. Project teams that perform data discovery profiling on data sources before building ETL data mappings can expect to achieve the following:
- A more accurate understanding of the data types, formats, and precision of each source
- A more precise specification for the types of data transformations required to clean, de-duplicate, aggregate, and apply business rules
- Discovery of source data anomalies and outliers which require further investigation for possible remediation prior to the migration of data – this, leading to a more robust error handling and exception handling process
- Identification of source data, previously unknown, that meets the business requirements and needs to be migrated – discovery can lead to uncovering data that was previously undefined or unobtainable for data migrations
Profiling each source candidate will likely highlight information about the actual source data of which neither technical nor business resources were aware. However, analysis of profiling results will require both technical and business resources to understand and act on source data profiling results. All profiling information will be necessary in order to correctly detail the correct mapping and data transformation requirements for the source data (see Figure 1).
Both technical and business knowledge of the source and target data are critical to the success of DW projects. The need for multiple resources from distinct functional areas to coordinate each step in the DW/BI development life cycle is a challenging aspect of data integration development.
Implementing data profiling for successful source data discovery
Source data discovery represents an inventory and analysis of data from various “potential” sources to gain insight into obscure patterns and trends. It is the first step in fully harnessing and understanding an organization’s data to inform critical business decisions in response to a DW/BI project’s requirements.
All too often, after just a few meetings, business/data analysts begin developing ETL mappings — next, developers code ETLs; then, unit testing. Soon, however, the project team realizes that although ETL’s were written to requirements specifications, data loads don’t “look right”.
After troubleshooting they find the cause is source data that does not meet DW/BI requirements. The team might decide that data transformations and “data joins” will eliminate some of the discrepancies. In effect, they are performing two critical functions left out of the original development plan – 1) data discovery to include profiling followed by 2) data enhancement / cleansing / enrichment.
A well-planned data discovery process will result in a more efficient DW ETL design. You can profile your source data to discover structure, relationships and data rules. Data profiling provides statistical information about compliant data and outliers.
Data discovery tools can overcome problems of scale because many of those tools can scan large environments and identify data in a fraction of the time required by a team of human analysts. Tools offer a much greater chance of finding the best sources of critical business data.
General steps to source data discovery include:
- Identify specific data domains needed to meet required business reporting
- Uncover potential internal and external sources of that data from enterprise applications that store, collect, or consume address data
- Prioritize candidate source data for analysis and movement into a data warehouse by using (for example) data lake metadata
- Ensure that each source meets the privacy and regulatory requirements to be used for your purpose – Checkout the benefits of data classification
- Ensure that each source will be adequately available and accessible according to required frequencies
- Use data preparation (ex. profiling and cleansing) tools to perform portions of an overall refinement process, by integrating with other types of curation and preparation as part of an iterative approach to data refinement
Table 1 lists and describes useful profiling tasks. Single column profiling refers to the analysis of values in a single column ranging from simple counts and aggregation functions to distribution analysis and discovery of patterns and data types. Multi-column profiling is a set of activities that can be applied to a single column but allows for the analysis of inter-value dependencies across columns, resulting in association rules, clustering and outlier detection.
• Number of rows
• Frequency histograms (equi-width, equi-depth, etc.)
Patterns, data types, domains
• Basic type
• Generic data type, such as numeric, alphabetic, alphanumeric, date, time
Table 1: Typical metadata gathered as a result of single-column data profiling.
Table 2 presents a method for assessing the level of data quality experienced during source data profiling. (S. Juddoo, “Overview of Data Quality Challenges in the Context of Big Data”, ResearchGate.net, Dec. 2015)
Assessing Data Quality
Extent to which data is available, or easily and quickly retrievable
Suitable volume of data
Extent to which volume of data is appropriate for the task at hand
Extent to which data is regarded as true and credible
Extent to which data is not missing and is of sufficient breadth and depth for the task at hand
Extent to which data is presented in the same format
Ease of manipulation
Extent to which data is easy to manipulate and apply to different tasks
Extent to which data is correct and reliable
Extent to which data is in appropriate languages, symbols and units, and the definitions are clear
Extent to which data is unbiased, unprejudiced and impartial
Extent to which data is applicable and helpful for the task at hand
Extent to which data is highly regarded in terms of its source and content
Extent to which access to data is restricted appropriately to maintain its security
Extent to which data is sufficiently up-to-date
Extent to which data is easily comprehended
Extent to which data is beneficial and provides advantages from its uses
Table 2: A method of measuring data quality for individual data source elements and attributes.
As data profiling leads to the improvement of data quality, data management, and data governance, it is important for customers, data scientists, and DBAs to use data profiling processes and tools. Data is an asset for all users, therefore, data quality should be controlled by procedures, rules, people, and software.
This is the crucial but often overlooked step in data preparation: The DW/BI team needs to fully understand source data before further preparing it for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your data set.
The first purpose of source data profiling and analysis is to decide if candidate data sources are worthy of inclusion in your project. As data warehouse guru Ralph Kimball wrote in his book, The Data Warehouse Toolkit, “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team, even if it is bad news.”