how to implement data profiling for source data discovery

Effective Data Warehoues (DW) data source profiling is often an overlooked step in data warehouse data preparation. DW project teams need to understand all quality aspects of source data before preparation for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your source data sets.

The first purpose of this profiling analysis is to decide if the data source is even worth including in your project. As data warehouse guru Ralph Kimball writes in his book The Data Warehouse Toolkit, “Early disqualification of data sources is a responsible step that can earn you respect from the rest of the team, even when it seems to be bad news.”

If the data source is deemed worthy of inclusion, results from profiling this source will help you evaluate the data for overall quality and estimate the ETL work necessary to cleanse the data for downstream analysis.

A leading cause of data warehousing barriers during planning and development is extracting erroneous or poor quality source data as input to data warehouse ETLs.

Data discovery, data mappings & design, ETL development

Typical ETLs extract data from sources and loads it to targets. Project teams that perform data discovery profiling on data sources before building ETL data mappings can expect to achieve the following:

  • A more accurate understanding of the data types, formats, and precision of each source
  • A more precise specification for the types of data transformations required to clean, de-duplicate, aggregate, and apply business rules
  • Discovery of source data anomalies and outliers which require further investigation for possible remediation prior to the migration of data – this, leading to a more robust error handling and exception handling process
  • Identification of source data, previously unknown, that meets the business requirements and needs to be migrated – discovery can lead to uncovering data that was previously undefined or unobtainable for data migrations

Profiling each source candidate will likely highlight information about the actual source data of which neither technical nor business resources were aware. However, analysis of profiling results will require both technical and business resources to understand and act on source data profiling results. All profiling information will be necessary in order to correctly detail the correct mapping and data transformation requirements for the source data (see Figure 1).
Both technical and business knowledge of the source and target data are critical to the success of DW projects. The need for multiple resources from distinct functional areas to coordinate each step in the DW/BI development life cycle is a challenging aspect of data integration development.

source data profiling
Figure 1 : Profiling of all source data to provide metadata as input to data warehouse design

Implementing data profiling for successful source data discovery

Source data discovery represents an inventory and analysis of data from various “potential” sources to gain insight into obscure patterns and trends. It is the first step in fully harnessing and understanding an organization’s data to inform critical business decisions in response to a DW/BI project’s requirements.
All too often, after just a few meetings, business/data analysts begin developing ETL mappings — next, developers code ETLs; then, unit testing. Soon, however, the project team realizes that although ETL’s were written to requirements specifications, data loads don’t “look right”.
After troubleshooting they find the cause is source data that does not meet DW/BI requirements. The team might decide that data transformations and “data joins” will eliminate some of the discrepancies. In effect, they are performing two critical functions left out of the original development plan – 1) data discovery to include profiling followed by 2) data enhancement / cleansing / enrichment.
A well-planned data discovery process will result in a more efficient DW ETL design. You can profile your source data to discover structure, relationships and data rules. Data profiling provides statistical information about compliant data and outliers.
Data discovery tools can overcome problems of scale because many of those tools can scan large environments and identify data in a fraction of the time required by a team of human analysts. Tools offer a much greater chance of finding the best sources of critical business data.
General steps to source data discovery include:

  1. Identify specific data domains needed to meet required business reporting
  2. Uncover potential internal and external sources of that data from enterprise applications that store, collect, or consume address data
  3. Prioritize candidate source data for analysis and movement into a data warehouse by using (for example) data lake metadata
  4.  Ensure that each source meets the privacy and regulatory requirements to be used for your purpose – Checkout the benefits of data classification
  5. Ensure that each source will be adequately available and accessible according to required frequencies
  6. Use data preparation (ex. profiling and cleansing) tools to perform portions of an overall refinement process, by integrating with other types of curation and preparation as part of an iterative approach to data refinement

Table 1 lists and describes useful profiling tasks. Single column profiling refers to the analysis of values in a single column ranging from simple counts and aggregation functions to distribution analysis and discovery of patterns and data types. Multi-column profiling is a set of activities that can be applied to a single column but allows for the analysis of inter-value dependencies across columns, resulting in association rules, clustering and outlier detection.

#

Category

Task

Description

1

Cardinalities

• Num-rows
• Value length
• Null values
• Distinct
• Uniqueness

• Number of rows
• Measurements of value lengths (min., max., median, and average)
• Number or percentage of null values
• Number of distinct values; sometimes called “cardinality”
• Number of distinct values divided by the number of rows

2

Value distributions

• Histogram
• Constancy
• Quartiles
• First digit

• Frequency histograms (equi-width, equi-depth, etc.)
• Frequency of most frequent value divided by number of rows
• Three points that divide the (numeric) values into four equal groups
• Distribution of first digit in numeric values

3

Patterns, data types, domains

• Basic type
• Data type
• Size
• Decimals
• Patterns
• Data class
• Domain

• Generic data type, such as numeric, alphabetic, alphanumeric, date, time
• Concrete DBMS-specific data type, such as varchar, timestamp, etc.
• Maximum number of digits in numeric values
• Maximum number of decimals in numeric values
• Expected value patterns (ex., Aa9...)
• Semantic, generic data type, such as code, indicator, text, date/time, quantity, identifier
• Classification of the semantic domain, such as credit card, first name, city, phenotype

Table 1: Typical metadata gathered as a result of single-column data profiling.

Table 2 presents a method for assessing the level of data quality experienced during source data profiling. (S. Juddoo, “Overview of Data Quality Challenges in the Context of Big Data”, ResearchGate.net, Dec. 2015)

#

Assessing Data Quality

Definitions

1

Accessibility

Extent to which data is available, or easily and quickly retrievable

2

Suitable volume of data

Extent to which volume of data is appropriate for the task at hand

3

Believability

Extent to which data is regarded as true and credible

4

Completeness

Extent to which data is not missing and is of sufficient breadth and depth for the task at hand
Measure of missing values for a specific column in a table; often illustrated via the NULL value and which could represent facts as to value not existing, value existing but unknown and not knowing if a value exists

5

Consistent representation

Extent to which data is presented in the same format

6

Ease of manipulation

Extent to which data is easy to manipulate and apply to different tasks

7

Free-of-error

Extent to which data is correct and reliable

8

Interpretability

Extent to which data is in appropriate languages, symbols and units, and the definitions are clear

9

Objectivity

Extent to which data is unbiased, unprejudiced and impartial

10

Relevancy

Extent to which data is applicable and helpful for the task at hand

11

Reputation

Extent to which data is highly regarded in terms of its source and content

12

Security

Extent to which access to data is restricted appropriately to maintain its security

13

Timeliness

Extent to which data is sufficiently up-to-date

14

Understandability

Extent to which data is easily comprehended

15

Value-added

Extent to which data is beneficial and provides advantages from its uses

Table 2: A method of measuring data quality for individual data source elements and attributes.

Conclusion

As data profiling leads to the improvement of data quality, data management, and data governance, it is important for customers, data scientists, and DBAs to use data profiling processes and tools. Data is an asset for all users, therefore, data quality should be controlled by procedures, rules, people, and software.

This is the crucial but often overlooked step in data preparation: The DW/BI team needs to fully understand source data before further preparing it for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your data set.

The first purpose of source data profiling and analysis is to decide if candidate data sources are worthy of inclusion in your project. As data warehouse guru Ralph Kimball wrote in his book, The Data Warehouse Toolkit, “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team, even if it is bad news.”

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

About the author 

Wayne Yaddow

Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

You may also like:

Managing DW/ BI data integration risks through data reconciliation and data lineage processes
Main considerations for testing BI reports
>