How to implement data profiling for successful source data discovery

how to implement data profiling for source data discovery

Effective Data Warehoues (DW) data source profiling is often an overlooked step in data warehouse data preparation. DW project teams need to understand all quality aspects of source data before preparation for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your source data sets.

The first purpose of this profiling analysis is to decide if the data source is even worth including in your project. As data warehouse guru Ralph Kimball writes in his book The Data Warehouse Toolkit, “Early disqualification of data sources is a responsible step that can earn you respect from the rest of the team, even when it seems to be bad news.”

If the data source is deemed worthy of inclusion, results from profiling this source will help you evaluate the data for overall quality and estimate the ETL work necessary to cleanse the data for downstream analysis.

A leading cause of data warehousing barriers during planning and development is extracting erroneous or poor quality source data as input to data warehouse ETLs.

Data discovery, data mappings & design, ETL development

Typical ETLs extract data from sources and loads it to targets. Project teams that perform data discovery profiling on data sources before building ETL data mappings can expect to achieve the following:

A more accurate understanding of the data types, formats, and precision of each source
A more precise specification for the types of data transformations required to clean, de-duplicate, aggregate, and apply business rules
Discovery of source data anomalies and outliers which require further investigation for possible remediation prior to the migration of data – this, leading to a more robust error handling and exception handling process
Identification of source data, previously unknown, that meets the business requirements and needs to be migrated – discovery can lead to uncovering data that was previously undefined or unobtainable for data migrations

Profiling each source candidate will likely highlight information about the actual source data of which neither technical nor business resources were aware. However, analysis of profiling results will require both technical and business resources to understand and act on source data profiling results. All profiling information will be necessary in order to correctly detail the correct mapping and data transformation requirements for the source data (see Figure 1).
Both technical and business knowledge of the source and target data are critical to the success of DW projects. The need for multiple resources from distinct functional areas to coordinate each step in the DW/BI development life cycle is a challenging aspect of data integration development.

source data profiling — Figure 1 : Profiling of all source data to provide metadata as input to data warehouse design

Implementing data profiling for successful source data discovery

Source data discovery represents an inventory and analysis of data from various “potential” sources to gain insight into obscure patterns and trends. It is the first step in fully harnessing and understanding an organization’s data to inform critical business decisions in response to a DW/BI project’s requirements.
All too often, after just a few meetings, business/data analysts begin developing ETL mappings — next, developers code ETLs; then, unit testing. Soon, however, the project team realizes that although ETL’s were written to requirements specifications, data loads don’t “look right”.
After troubleshooting they find the cause is source data that does not meet DW/BI requirements. The team might decide that data transformations and “data joins” will eliminate some of the discrepancies. In effect, they are performing two critical functions left out of the original development plan – 1) data discovery to include profiling followed by 2) data enhancement / cleansing / enrichment.
A well-planned data discovery process will result in a more efficient DW ETL design. You can profile your source data to discover structure, relationships and data rules. Data profiling provides statistical information about compliant data and outliers.
Data discovery tools can overcome problems of scale because many of those tools can scan large environments and identify data in a fraction of the time required by a team of human analysts. Tools offer a much greater chance of finding the best sources of critical business data.
General steps to source data discovery include:

Identify specific data domains needed to meet required business reporting
Uncover potential internal and external sources of that data from enterprise applications that store, collect, or consume address data
Prioritize candidate source data for analysis and movement into a data warehouse by using (for example) data lake metadata
Ensure that each source meets the privacy and regulatory requirements to be used for your purpose – Checkout the benefits of data classification
Ensure that each source will be adequately available and accessible according to required frequencies
Use data preparation (ex. profiling and cleansing) tools to perform portions of an overall refinement process, by integrating with other types of curation and preparation as part of an iterative approach to data refinement

Table 1 lists and describes useful profiling tasks. Single column profiling refers to the analysis of values in a single column ranging from simple counts and aggregation functions to distribution analysis and discovery of patterns and data types. Multi-column profiling is a set of activities that can be applied to a single column but allows for the analysis of inter-value dependencies across columns, resulting in association rules, clustering and outlier detection.

#	Category	Task	Description
1	Cardinalities	• Num-rows • Value length • Null values • Distinct • Uniqueness	• Number of rows • Measurements of value lengths (min., max., median, and average) • Number or percentage of null values • Number of distinct values; sometimes called “cardinality” • Number of distinct values divided by the number of rows
2	Value distributions	• Histogram • Constancy • Quartiles • First digit	• Frequency histograms (equi-width, equi-depth, etc.) • Frequency of most frequent value divided by number of rows • Three points that divide the (numeric) values into four equal groups • Distribution of first digit in numeric values
3	Patterns, data types, domains	• Basic type • Data type • Size • Decimals • Patterns • Data class • Domain	• Generic data type, such as numeric, alphabetic, alphanumeric, date, time • Concrete DBMS-specific data type, such as varchar, timestamp, etc. • Maximum number of digits in numeric values • Maximum number of decimals in numeric values • Expected value patterns (ex., Aa9...) • Semantic, generic data type, such as code, indicator, text, date/time, quantity, identifier • Classification of the semantic domain, such as credit card, first name, city, phenotype

Table 1: Typical metadata gathered as a result of single-column data profiling.

Table 2 presents a method for assessing the level of data quality experienced during source data profiling. (S. Juddoo, “Overview of Data Quality Challenges in the Context of Big Data”, ResearchGate.net, Dec. 2015)

#	Assessing Data Quality	Definitions
1	Accessibility	Extent to which data is available, or easily and quickly retrievable
2	Suitable volume of data	Extent to which volume of data is appropriate for the task at hand
3	Believability	Extent to which data is regarded as true and credible
4	Completeness	Extent to which data is not missing and is of sufficient breadth and depth for the task at hand Measure of missing values for a specific column in a table; often illustrated via the NULL value and which could represent facts as to value not existing, value existing but unknown and not knowing if a value exists
5	Consistent representation	Extent to which data is presented in the same format
6	Ease of manipulation	Extent to which data is easy to manipulate and apply to different tasks
7	Free-of-error	Extent to which data is correct and reliable
8	Interpretability	Extent to which data is in appropriate languages, symbols and units, and the definitions are clear
9	Objectivity	Extent to which data is unbiased, unprejudiced and impartial
10	Relevancy	Extent to which data is applicable and helpful for the task at hand
11	Reputation	Extent to which data is highly regarded in terms of its source and content
12	Security	Extent to which access to data is restricted appropriately to maintain its security
13	Timeliness	Extent to which data is sufficiently up-to-date
14	Understandability	Extent to which data is easily comprehended
15	Value-added	Extent to which data is beneficial and provides advantages from its uses

Table 2: A method of measuring data quality for individual data source elements and attributes.

Conclusion

As data profiling leads to the improvement of data quality, data management, and data governance, it is important for customers, data scientists, and DBAs to use data profiling processes and tools. Data is an asset for all users, therefore, data quality should be controlled by procedures, rules, people, and software.

This is the crucial but often overlooked step in data preparation: The DW/BI team needs to fully understand source data before further preparing it for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your data set.

The first purpose of source data profiling and analysis is to decide if candidate data sources are worthy of inclusion in your project. As data warehouse guru Ralph Kimball wrote in his book, The Data Warehouse Toolkit, “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team, even if it is bad news.”

Share0

Tweet0

About the author

Wayne Yaddow

Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

How to implement data profiling for successful source data discovery

Data discovery, data mappings & design, ETL development

Implementing data profiling for successful source data discovery

Conclusion

Wayne Yaddow

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Managing DW/ BI data integration risks through data reconciliation and data lineage processes

Main considerations for testing BI reports

Data quality considerations during the DW/ BI design phase