Decisions in today’s organizations have become increasingly data-driven and real-time. Therefore, the business intelligence databases that support decision makers must be of exceptional quality.
We sometimes confuse testing a data warehouse that produce business intelligence (BI) reports with backend or database testing or with testing the BI reports themselves. Data warehouse testing is much more complex and diverse. Nearly everything in BI applications involves the data that “drives” intelligent decision making.
Data integrity can be compromised during each DW/ BI phase: when data is created, integrated, moved, or transformed.
This article highlights strategies and best practices for catching data integrity issues during the project design phase.
Common data quality issues to be discovered during DW/BI design
A first level of testing and validation begins with the formal acceptance of the logical data model and “low level design” (LLD). All further testing and validation will be based on the understanding of each of the data elements in the model.
Data elements that are created through a transformation or aggregation process must be clearly identified and calculations for each of these data elements clearly documented and easily interpreted.
During LLD reviews and updates, special consideration should be given to typical data modeling scenarios that occur in the project. For example:
- Verify that many-to-many attribute relationships are clarified and resolved
- Verify the types of keys that are used: surrogate keys, natural keys, ETL generated keys
- Verify that business analysts/DBA’s review with ETL architects and developers (application), the lineage and business rules for extracting, transforming, and loading the data warehouse
- Verify that all transformation rules, summarization rules, and matching and consolidation rules have clear specifications
- Confirm that specified transformations, business rules and cleansing described in low level design (LLD) and application logic specifications meet business requirements and that they have been coded correctly in ETL, Java, or SQL used for data loads
- Verify that ETL procedures are documented to monitor and control data extraction, transformation, and loading. The procedures should describe how to handle exceptions and program failures
- Verify that data consolidation of duplicate or merged data is properly handled
- Verify that samplings of domain transformations will be utilized to confirm they are properly changed
- Ensure unique values exist for primary and foreign key fields between the source data and the data loaded to the warehouse
- Validate that target data types are as specified in the design and/or the data model
- Verify that data field types and formats are specified and implemented
- Verify that default values are specified for fields where needed
- Verify that processing for invalid field values in the source are defined
- Verify that expected ranges of field values are specified
- Verify that all keys generated by the ETL “sequence generator” are identified
- Verify that slowly-changing dimensions (SCD’s) are described
Conclusion
Data warehouse testing is frequently deferred until late in the project life-cycle. If testing is shortchanged (e.g., due to schedule overruns or limited resource availability), there’s a high risk that critical data integrity issues will slip through the verification efforts. Even if thorough testing is performed, it’s difficult and costly to address most data integrity issues exposed by this late-cycle testing.
When testing during a late DW/BI life-cycle phase, the cause of errors can be anything from data quality issues occurring when the data enters the data warehouse, to a data processing issue caused by failures of the business logic along layers of data warehouse loading and its BI reporting components.