Today’s Data Warehoues (DW) source data often originates from a large variety of data formats as illustrated in Figure 1. Original sources may consist of external data, reference data, business transactions, production raw data and more. The multitude of data sources each has its own methods, systems, and architectures for storing data.
When working with source data, DW development teams often face new or unknown datasets that need to be processed to complete a specific DW or BI requirement. Due to the unknown characteristics of the data and its schema, it may be unclear how ETL processing should be carried out, making it necessary to inspect and analyze the data first.
In common DW/BI technical literature, this inspection is usually referred to as data profiling, defined as “the set of activities and processes to determine the metadata about a given dataset”. Examples of metadata include aggregated counts, correlations, value distributions, and functional dependencies. Considering the foregoing, we have frequently observed that comprehensive source data profiling is performed rarely by practitioners in real-world scenarios.
Rather, it seems to be common practice to inspect an unknown dataset in a manual fashion, by opening it, e.g., in a spreadsheet tool, and simply skimming through it. The practice is not only time- and cost-intensive, but also highly dependent on individual skills, prone to errors or inconsistencies, and it does not scale at all to larger datasets that contain more than a few values.
Another reason practitioners neglect data profiling is a lack of knowledge. Many people are not aware of the process and how to proceed. Additionally, there seems to be little understanding of the potential benefits, which leads to a reluctance to learn.
Industry pundits agree: There’s a certain need for effective source data discovery
A September 24, 2019 article posted on SDTimes.com, states that “A number of factors can compromise the health of data, making that data unmanageable and unusable for today’s businesses. Specifically, data professionals face a dramatic increase in data complexity, variety and scale.” Two of those factors are described here:
Data sprawl – The dramatic variety and volume of data sources used by DW projects.
Data drift – Unpredictable, unannounced, and unending transformations of data characteristics caused by the operation, maintenance and modernization of the systems that produce and enrich the data. Records that those systems create change constantly as the owners adopt updates or re-platform their systems. Modern enterprises experience new data constantly in different formats, from various technologies and new locations.
Therefore data sprawl and drift factors result in a need for extensive data preparation before ingestion into a data warehouse – data profiling of all sources is a primary means of data preparation. Other data preparation tasks include data cleansing, planning for data joins/merges, aggregations and data deduplication.
William Laurent wrote a piece entitled, “Best Practices for Data Warehouse Database Developers”. The number one best practice was “make sure you are provided with a source data discovery and associated data dictionary process before starting ETL design and development. Data discovery can help build that data dictionary and design without relying on assumptions and assertions made by business analysts and database administrators.”
Common data quality issues in candidate data sources
Profiling “candidate” sources aimed at DW integration is critical to successful development. Every data integration project discovers something in potential sources that would significantly affect the design of the solution, whether it is unexpected content, a lack of content, or poor quality data, or even that the data required doesn’t actually exist where it was expected. Negotiation with data owners and data security groups should be continued until acceptable sources are found that allow data profiling on the proposed source and target data.
Data profiling can reveal many data quality issues. Some common issues include:
- Varying timeliness of data sources – many sources contain the timeframes needed, others do not
- Presence of duplicate data
- Presence of conflicting data among multiple sources
- Missing data in fields defined as “not null”
- Source fields defined to allow free form (no restrictions) on data entry
- Unrealized data relationships among data sources
- Multi-purpose fields present in data sources
- Differing business rules were applied across various data sources
Additional examples of source data quality issues may be found in this ResearchGate.net paper: R. Singh, K. Singh, “A Descriptive Classification for Causes of Data Quality Problems in Data Warehousing”, ResearchGate.net, May 2010.
The challenges of data profiling to support effective data discovery
In a 2019 Enterprise Data Quality Survey of 175 large businesses, Syncsort tallied results of investigations into the challenges to data quality (Figure 2). Note that the number of sources, volume of data, and variety of data formats are major challenges to data quality.
As soon as a candidate data sources have been identified, data profiling assessments should be considered on each to recommend a “yes” or “no” decision regarding utilization of the data source.
The following are known challenges to a successful data source quality assessment process.
- Data completeness: The complexity of a DW often increases geometrically with the span of time for data to be loaded. As the length of time in each source increase (ex., number of months/years), the likelihood of data quality issues increases – often the result of changed data over time.
- Pattern analysis for important fields: Pattern analysis (i.e., which formats exist for a given attribute) may become difficult due to the large size of the candidate data store (i.e., profiling tools may be inadequate to the task
- Analysis of counts and statistic functions: Profiling results may demonstrate that min/max counts, sum, mode, percentiles, mean, or standard deviations do not meet project requirements.
- Column profiling: single table structural profiling and cross-table structural profiling of the data sources can sometimes not be accomplished with available tools across many or all data formats.
- Column and table relationship profiling: table structural profiling, which seeks relationships among columns an/or tables, may uncover issues with primary keys and data structure quality. Cross-table structural profiling compares data between tables.
- Inadequate documentation for source changes over time: Improper documentation results when written by those who don’t understand the domain thoroughly; documented changes may be vague and incomplete.
- Profiling tools with inadequate profiling capabilities for dependencies, formats, and values of source data: Profiling according to the data format, inter-dependencies between attributes, and their values are not possible due to lack of appropriate tools.
- Data profiling conducted manually (ex., SQL queries) is likely to be incomplete and resource intensive: Traditional data profiling requires skilled technical resources who can manually plan and query using SQL.
- Source data “change” history, business rules “change” history data may not be available
- Data dictionaries representing source data under investigation may be incomplete or not available
- Historically inadequate testing on individual data source may lead to poor data quality
- Lack of business data ownership, policy, and planning of the entire enterprise data contributes to data quality problems
Best practices to achieve optimal source data profiling
The following are a few of the practices that help ensure optimal source data profiling for your AI and BI projects. Many more can be found from data preparation service and product suppliers. Also, don’t forget to checkout How to implement data profiling for successful source data discovery.
- Be prepared to sample each data source: Data profiling is resource intensive and profile execution times can be quite long, especially on large data volumes. Profiling the entire contents of one or all of your source systems is not likely to be the most efficient way to produce useful profiling results. An iterative approach, in which initially limited data profiling produces discoveries that are used in deciding what to profile next, is more likely to be effective when faced with a large and complex set of sources to profile.
- Profile the most important data sources that are likely to contain erroneous data: For example, customer and order data that are entered manually are more likely to contain errors than product data downloaded from suppliers.
- Assure that a data dictionary is in the plan for each source: Before profiling data objects, ensure that your project includes the correct and required metadata (e.g., data dictionaries, data catalogs) for the source objects on which you are performing data profiling.
- Start with a prioritized selection of candidate sources: It’s easy to jump into prepping data without thinking about where the data comes from and the reliability of the source. However, for cases where you’ll have to load data repeatedly, the quality, accessibility, and format of the data source can have a big impact on your analytics.
- Extract data to an environment with tools: Once you’ve identified a reliable data source, you need to pull this data into an environment where it can be analyzed and manipulated. Smaller or sample data files with a relatively good native structure can be opened with text editors or spreadsheets. Larger and/or more complicated data sets will require more powerful profiling tools, the likes of which are included with many ETL tools, high-end statistical software, or enterprise-class BI packages.
- Assess data formats: Your analysis always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data. And in the age of big data, you will surely be faced with data from a variety of sources. Your first step in analyzing your data is making sure you can read the files you’re given.
- Verify data accuracy: Once you’re comfortable that the data is formatted the way you want it, you still need to make sure it’s accurate and that it makes sense. This step requires that you have some knowledge of the subject area you are working in.
Identify outliers: Outliers are data points that are out of whack with the rest of the data. They are either very large or very small values compared with the rest of the dataset.
- Deal with missing values and duplicates: Missing values are one of the most common (and annoying) data problems you will encounter. Your first impulse might be to drop records with missing values from your analysis. The problem with this is that missing values are frequently not just random little data glitches.
- Comprehensive data profiling: You need actually to look at what each field contains. For example, it’s not wise to trust that just because a field is listed as a character field, it actually contains character data. Data sourcing is generally broken down into three steps:
- Define the data needed for a given business task
- Identify potential sources of that data, along with its business and IT owner(s)
- Confirm that the data will be sufficiently available with the history and frequency required by the business task
As data profiling leads to the improvement of data quality, data management, and data governance, it is important for customers, data scientists, and DBAs to use data profiling processes and tools. Data is an asset for all users, therefore, data quality should be controlled by procedures, rules, people, and software.
This is the crucial but often overlooked step in data preparation: The DW/BI team needs to fully understand source data before further preparing it for downstream consumption. Beyond simple visual examination, you need to profile, visualize, detect outliers, and find null values and other junk data in your data set.
The first purpose of source data profiling and analysis is to decide if candidate data sources are worthy of inclusion in your project. As data warehouse guru Ralph Kimball wrote in his book, The Data Warehouse Toolkit, “Early disqualification of a data source is a responsible step that can earn you respect from the rest of the team, even if it is bad news.”