3 key data integrity testing strategies for DW/ BI

Data warehousing and business intelligence users assume, and need, trustworthy data.
In the Gartner Group’s Online IT Glossary, data integrity and data integrity testing are defined as follows:

Data Integrity: the quality of the data residing in data repositories and database objects. The measurement which users consider when analyzing the value and reliability of the data.

Data Integrity Testing: verification that moved, copied, derived, and converted data is accurate and functions correctly within a single subsystem or application.

Data integrity processes should not only help you understand a project’s data integrity, but also help you gain and maintain the accuracy and consistency of data over its lifecycle. This includes data management best practices such as preventing data from being altered each time it is copied or moved. Processes should be established to maintain DW/ BI data integrity at all times. Data, in its final state, is the driving force behind industry decision making. Errors with data integrity commonly arise from human error, noncompliant operating procedures, errors in data transfers, software defects, compromised hardware, and physical compromise to devices.
This article provides a focus on DW/ BI “data integrity testing” — testing processes that support:

  • All data warehouse sources and target schemas
  • Extract, Transform, Load (ETL) processes
  • Business intelligence components andfront-end applications

We will cover how key data integrity testing strategies are addressed in each of the above categories.
Other categories of DW/ BI and ETL testing, even though important, are not a focus in this article (e.g., functional, performance, security, scalability, system and integration testing, end-to-end, etc.).

Classifications of Data Integrity for DW/ BI Systems

To build upon Gartner’s definition that you read above, data Integrity is

an umbrella term that refers to the consistency, accuracy, and correctness of data stored in a database.

There are 3 primary types of data integrity: entity, domain, and referential.

  1. Entity Integrity ensures that each row in a table (for example) is uniquely identified and without duplication. Entity integrity is often enforced by placing primary key and foreign key constraints on specific columns. Testing may be achieved by defining duplicate or the null values in test data.
  2. Domain Integrity requires that each set of data values/columns falls within a specific permissible defined range. Examples of domain integrity are correct data type, format, and data length; values must fall within the range defined for the system; null status; and permitted size values. Testing may be accomplished, in part, using null, default and invalid values.
  3. Referential Integrity is concerned with keeping the relationships between tables synchronized. Referential integrity is often enforced with primary key and foreign key relationships. It may be tested, for example, by deleting parent rows or the child rows in tables.

Verifying Data Integrity in Schemas, ETL Processes, and BI Reports

Before we dive into the 3 key data integrity strategies, let’s quickly outline a commonframework (Figure 1) that illustrates the major DW/ BI components that are generally verified in each phase of DH/ BI testing.

dw bi testing framework
Figure 1: General Framework for DW/ BI Testing During the software development lifecycle (SDLC)

Learn how to build a Data Quality issues log (free template included)


It’s important to be on the same page with this as the following 3 key DW/ BI components are presented in this testing framework:

1. Verifications of Source and Target Data Requirements and Technical Schema Implementations

Requirements and schema-level tests confirm to what extent the design of each data component matches the targeted business requirements. This process should include the ability to verify:

  • Business and technical requirements for all source and target data
  • Data integrity specifications technically implemented (database management systems, file systems, text files, etc.)
  • Data models for each implemented data schema
  • Source to target data mappings vs. data loaded into DW targets. Examples of sources and associated targets include source data that are loaded to staging targets as well as staging data that are loaded to data warehouse or data mart targets.

Schema quality represents the ability of a schema to adequately and efficiently project ‘information/data’. Schema in this definition refers to the schema of the data warehouse regardless if it is a conceptual, logical or physical schema, star, constellation, or normalized schema. However, this definition is extended here to include the schemas of all data storages used in the whole data warehouse system including the data sourcing, staging, the operational data store, and the data marts. It is beneficial to assess the schema quality in the design phase of the data warehouse.

Detecting, analyzing and correcting schema deficiencies will boost the quality of the DW/ BI system. Schema quality could be viewed from various dimensions, namely:

  • Schema correctness
  • Schema completeness
  • Schema conformity
  • Schema integrity
  • Interpretability
  • Tractability
  • Understandability
  • Concise representation

2. ETL source and target data integrity tests

Most DW integrity testing and evaluation focus on this process. Various functional and non-functional testing methods are applied to test the ETL process logic for data. The goal is to

  • Verify that valid and invalid conditions are correctly processed for all source and target data
  • Ensure primary and foreign key integrity
  • Verify test correctness of data transformations
  • Ensure data cleansing
  • Guarantee application of business rules, etc.

A properly-designed ETL system extracts data from source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a format that enables application developers to build applications and enables end users to make decisions.

3. BI reporting verifications

BI applications provide an interface that helps users interact with the back-end. The design of these reports is critical for understanding and planning the data integrity tests.
Insights such as what content uses which information maps, what ranges are leveraged in which indicators, and where interactions exist between indicators is required to build a full suite of test cases. If any measures are defined in the report itself, these should be verified as accurate. However, all other data elements that are pulled straight from the tables map should already have been validated from one of the above two sections.


A sample DW/ BI verification framework and sample verifications

DW/ BI data integrity verification is categorized here as follows. Figure 2 shows a verification classification framework for the techniques applicable to sources and targets in data warehouse, ETL process, and BI report applications.

dw bi integration testing
Figure 2: Framework for DWH/BI Data Integrity Verifications

The “what”, “when” and “where” of DW/ BI data integration testing is represented in the following table.

  • Column headings represent when and where data related testing will take place
  • Rows represent “what” data-related items should be considered for testing

A Sampling of Verifications in the Three Categories of Data Integrity Testing: Schemas, ETL Processes, and BI Reports:

Verifications of Source & Target Data Requirements and Technical Schema Implementations

Source & Target Data Integrity Tests After ETL’s

BI Reporting Verifications

•Data aggregation rules
•Data boundaries correct
•Data filtering, source to target correct
•Data formats correct
•Data lengths correct
•Data transformation rules correct and understandable
•Data types correct
•Date/time formats correct
•Default values defined
•Domain ranges defined
•Field data boundaries defined
•Field data constraints correct
•Field names correct
•Indices defined
•Null fields correct
•Numeric field precisions correct
•Primary & foreign keys assigned
•Surrogate keys identified

•“Lookups” work as expected
•All fields loaded as expected
•Concatenated data from multiple fields correct
•Correct handling of “change data capture” (CDC’s)
•Correct handling of “slowly changing dimensions” (SCD’s)
•Data Inserts, updates, deletes as expected
•Data profiling on source and target data – no anomalies
•Data sorted as defined
•Data transformations, cleaning, enrichment as expected
•Data/time format and values correct
•Default values correct
•Domain integrity maintained
•Duplicate data checks as expected
•ETL errors/anomalies logged
•Field data aggregations correct
•Field data constraints applied
•No field data truncations
•No negative values where positive expected
•No null field data when defined not null
•Numeric field precisions correct
•Parent to child relationships checked
•Referential integrity as expected
•Rejected records handled as expected
•Source to target field data copied with no changes
•Source to target record counts as expected
•Sources to targets data filtered correctly
•Trim functions are correct

•Data aggregation rules applied
•Data boundaries correct
•Data filtering correct
•Data formats correct
•Data lengths correct
•Data transformation rules applied
•Data types correct
•Data value sorting correct
•Date/time formats correct
•Default values correct
•Derived data correct
•Domain ranges as expected
•Drill up and downs display correct data
•Exported data correct
•Field data boundaries defined
•Field data constraints correct
•Field data traceable back to DW
•Field names correct
•Field totals correct
•Field values for aggregates correct
•Filtered data fields correct (ranges, ID’s, etc.)
•Graphed data correct
•Min, max, avg values correct
•Null fields correct
•Numeric field precisions correct
•Numeric precision for all fields
•Report data match DW/ data mart
•Report field default values correct
•Report formats comply with requirements
•Summary fields correct
•Validate the access to data – security

Key Takeaways

  • Data in its final state is the driving force behind organizational decision making.
  • Raw data is often changed and processed to reach a usable format for BI reports. Data integrity practices ensure that this DW/ BI information is attributable and accurate.
  • Data can easily become compromised if proper measures are not taken to verify it as it moves from each environment to become available to DW/ BI projects. Errors with data integrity commonly arise through human errors, noncompliant operating procedures, data transfers, software defects, and compromised hardware.
  • By applying the 3 key data integrity testing strategies introduced in this article, you should be able to improve quality and reduce time and costs when developing and maintaining a DW/ BI project.
  • Great article Wayne! However I could not validate the definition of Data Integrity that you referenced with hyperlink to Gartner? Maybe they removed it? I would really love an official definition from a credible source if you have an updated one.

    Thanks,
    Todd

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    About the author 

    Wayne Yaddow

    Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

    You may also like:

    What data quality testing skills are needed for data integration projects?
    Managing DW/ BI data integration risks through data reconciliation and data lineage processes
    Main considerations for testing BI reports
    >