Best practices to achieve optimal source data profiling

best practices for optimal source data profiling

Today’s Data Warehoues (DW) source data often originates from a large variety of data formats as illustrated in Figure 1. Original sources may consist of external data, reference data, business transactions, production raw data and more. The multitude of data sources each has its own methods, systems, and architectures for storing data.

data warehousing data source formats — Figure 1: A few of today’s many data source formats for data warehousing

When working with source data, DW development teams often face new or unknown datasets that need to be processed to complete a specific DW or BI requirement. Due to the unknown characteristics of the data and its schema, it may be unclear how ETL processing should be carried out, making it necessary to inspect and analyze the data first.

In common DW/BI technical literature, this inspection is usually referred to as data profiling, defined as “the set of activities and processes to determine the metadata about a given dataset”. Examples of metadata include aggregated counts, correlations, value distributions, and functional dependencies. Considering the foregoing, we have frequently observed that comprehensive source data profiling is performed rarely by practitioners in real-world scenarios.

Rather, it seems to be common practice to inspect an unknown dataset in a manual fashion, by opening it, e.g., in a spreadsheet tool, and simply skimming through it. The practice is not only time- and cost-intensive, but also highly dependent on individual skills, prone to errors or inconsistencies, and it does not scale at all to larger datasets that contain more than a few values.

Another reason practitioners neglect data profiling is a lack of knowledge. Many people are not aware of the process and how to proceed. Additionally, there seems to be little understanding of the potential benefits, which leads to a reluctance to learn.

Industry pundits agree: There’s a certain need for effective source data discovery

A September 24, 2019 article posted on SDTimes.com, states that “A number of factors can compromise the health of data, making that data unmanageable and unusable for today’s businesses. Specifically, data professionals face a dramatic increase in data complexity, variety and scale.” Two of those factors are described here:

Data sprawl – The dramatic variety and volume of data sources used by DW projects.
Data drift – Unpredictable, unannounced, and unending transformations of data characteristics caused by the operation, maintenance and modernization of the systems that produce and enrich the data. Records that those systems create change constantly as the owners adopt updates or re-platform their systems. Modern enterprises experience new data constantly in different formats, from various technologies and new locations.

Therefore data sprawl and drift factors result in a need for extensive data preparation before ingestion into a data warehouse – data profiling of all sources is a primary means of data preparation. Other data preparation tasks include data cleansing, planning for data joins/merges, aggregations and data deduplication.
William Laurent wrote a piece entitled, “Best Practices for Data Warehouse Database Developers”. The number one best practice was “make sure you are provided with a source data discovery and associated data dictionary process before starting ETL design and development. Data discovery can help build that data dictionary and design without relying on assumptions and assertions made by business analysts and database administrators.”

Common data quality issues in candidate data sources

Profiling “candidate” sources aimed at DW integration is critical to successful development. Every data integration project discovers something in potential sources that would significantly affect the design of the solution, whether it is unexpected content, a lack of content, or poor quality data, or even that the data required doesn’t actually exist where it was expected. Negotiation with data owners and data security groups should be continued until acceptable sources are found that allow data profiling on the proposed source and target data.

Data profiling can reveal many data quality issues. Some common issues include:

Varying timeliness of data sources – many sources contain the timeframes needed, others do not
Presence of duplicate data
Presence of conflicting data among multiple sources
Missing data in fields defined as “not null”
Source fields defined to allow free form (no restrictions) on data entry
Unrealized data relationships among data sources
Multi-purpose fields present in data sources
Differing business rules were applied across various data sources

Additional examples of source data quality issues may be found in this ResearchGate.net paper: R. Singh, K. Singh, “A Descriptive Classification for Causes of Data Quality Problems in Data Warehousing”, ResearchGate.net, May 2010.

The challenges of data profiling to support effective data discovery

In a 2019 Enterprise Data Quality Survey of 175 large businesses, Syncsort tallied results of investigations into the challenges to data quality (Figure 2). Note that the number of sources, volume of data, and variety of data formats are major challenges to data quality.

As soon as a candidate data sources have been identified, data profiling assessments should be considered on each to recommend a “yes” or “no” decision regarding utilization of the data source.

The following are known challenges to a successful data source quality assessment process.

Data completeness: The complexity of a DW often increases geometrically with the span of time for data to be loaded. As the length of time in each source increase (ex., number of months/years), the likelihood of data quality issues increases – often the result of changed data over time.
Pattern analysis for important fields: Pattern analysis (i.e., which formats exist for a given attribute) may become difficult due to the large size of the candidate data store (i.e., profiling tools may be inadequate to the task
Analysis of counts and statistic functions: Profiling results may demonstrate that min/max counts, sum, mode, percentiles, mean, or standard deviations do not meet project requirements.
Column profiling: single table structural profiling and cross-table structural profiling of the data sources can sometimes not be accomplished with available tools across many or all data formats.
Column and table relationship profiling: table structural profiling, which seeks relationships among columns an/or tables, may uncover issues with primary keys and data structure quality. Cross-table structural profiling compares data between tables.
Inadequate documentation for source changes over time: Improper documentation results when written by those who don’t understand the domain thoroughly; documented changes may be vague and incomplete.
Profiling tools with inadequate profiling capabilities for dependencies, formats, and values of source data: Profiling according to the data format, inter-dependencies between attributes, and their values are not possible due to lack of appropriate tools.
Data profiling conducted manually (ex., SQL queries) is likely to be incomplete and resource intensive: Traditional data profiling requires skilled technical resources who can manually plan and query using SQL.
Source data “change” history, business rules “change” history data may not be available
Data dictionaries representing source data under investigation may be incomplete or not available
Historically inadequate testing on individual data source may lead to poor data quality
Lack of business data ownership, policy, and planning of the entire enterprise data contributes to data quality problems

Best practices to achieve optimal source data profiling

The following are a few of the practices that help ensure optimal source data profiling for your AI and BI projects. Many more can be found from data preparation service and product suppliers. Also, don’t forget to checkout How to implement data profiling for successful source data discovery.

Be prepared to sample each data source: Data profiling is resource intensive and profile execution times can be quite long, especially on large data volumes. Profiling the entire contents of one or all of your source systems is not likely to be the most efficient way to produce useful profiling results. An iterative approach, in which initially limited data profiling produces discoveries that are used in deciding what to profile next, is more likely to be effective when faced with a large and complex set of sources to profile.
Profile the most important data sources that are likely to contain erroneous data: For example, customer and order data that are entered manually are more likely to contain errors than product data downloaded from suppliers.
Assure that a data dictionary is in the plan for each source: Before profiling data objects, ensure that your project includes the correct and required metadata (e.g., data dictionaries, data catalogs) for the source objects on which you are performing data profiling.
Start with a prioritized selection of candidate sources: It’s easy to jump into prepping data without thinking about where the data comes from and the reliability of the source. However, for cases where you’ll have to load data repeatedly, the quality, accessibility, and format of the data source can have a big impact on your analytics.
Extract data to an environment with tools: Once you’ve identified a reliable data source, you need to pull this data into an environment where it can be analyzed and manipulated. Smaller or sample data files with a relatively good native structure can be opened with text editors or spreadsheets. Larger and/or more complicated data sets will require more powerful profiling tools, the likes of which are included with many ETL tools, high-end statistical software, or enterprise-class BI packages.
Assess data formats: Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data. And in the age of big data, you will surely be faced with data from a variety of sources. Your first step in analyzing your data is making sure you can read the files you’re given.
Verify data accuracy: Once you’re comfortable that the data is formatted the way you want it, you still need to make sure it’s accurate and that it makes sense. This step requires that you have some knowledge of the subject area you are working in.
Identify outliers: Outliers are data points that are out of whack with the rest of the data. They are either very large or very small values compared with the rest of the dataset.
Deal with missing values and duplicates: Missing values are one of the most common (and annoying) data problems you will encounter. Your first impulse might be to drop records with missing values from your analysis. The problem with this is that missing values are frequently not just random little data glitches.
Comprehensive data profiling: You need actually to look at what each field contains. For example, it’s not wise to trust that just because a field is listed as a character field, it actually contains character data. Data sourcing is generally broken down into three steps:

Define the data needed for a given business task
Identify potential sources of that data, along with its business and IT owner(s)
Confirm that the data will be sufficiently available with the history and frequency required by the business task

Conclusion

As data profiling leads to the improvement of data quality, data management, and data governance, it is important for customers, data scientists, and DBAs to use data profiling processes and tools. Data is an asset for all users, therefore, data quality should be controlled by procedures, rules, people, and software.

This is the crucial but often overlooked step in data preparation: The DW/BI team needs to fully understand source data before further preparing it for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your data set.

The first purpose of source data profiling and analysis is to decide if candidate data sources are worthy of inclusion in your project. As data warehouse guru Ralph Kimball wrote in his book, The Data Warehouse Toolkit, “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team, even if it is bad news.”

Share0

Tweet0

About the author

Wayne Yaddow

Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

Best practices to achieve optimal source data profiling

Industry pundits agree: There’s a certain need for effective source data discovery

Common data quality issues in candidate data sources

The challenges of data profiling to support effective data discovery

Best practices to achieve optimal source data profiling

Conclusion

Wayne Yaddow

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Managing DW/ BI data integration risks through data reconciliation and data lineage processes

Main considerations for testing BI reports

Data quality considerations during the DW/ BI design phase