Data quality considerations during the DW/ BI design phase

Decisions in today’s organizations have become increasingly data-driven and real-time. Therefore, the business intelligence databases that support decision makers must be of exceptional quality.
We sometimes confuse testing a data warehouse that produce business intelligence (BI) reports with backend or database testing or with testing the BI reports themselves. Data warehouse testing is much more complex and diverse. Nearly everything in BI applications involves the data that “drives” intelligent decision making.
Data integrity can be compromised during each DW/ BI phase: when data is created, integrated, moved, or transformed.
This article highlights strategies and best practices for catching data integrity issues during the project design phase.

Common data quality issues to be discovered during DW/BI design

A first level of testing and validation begins with the formal acceptance of the logical data model and “low level design” (LLD). All further testing and validation will be based on the understanding of each of the data elements in the model.

Data elements that are created through a transformation or aggregation process must be clearly identified and calculations for each of these data elements clearly documented and easily interpreted.

During LLD reviews and updates, special consideration should be given to typical data modeling scenarios that occur in the project. For example:

Verify that many-to-many attribute relationships are clarified and resolved
Verify the types of keys that are used: surrogate keys, natural keys, ETL generated keys
Verify that business analysts/DBA’s review with ETL architects and developers (application), the lineage and business rules for extracting, transforming, and loading the data warehouse
Verify that all transformation rules, summarization rules, and matching and consolidation rules have clear specifications
Confirm that specified transformations, business rules and cleansing described in low level design (LLD) and application logic specifications meet business requirements and that they have been coded correctly in ETL, Java, or SQL used for data loads
Verify that ETL procedures are documented to monitor and control data extraction, transformation, and loading. The procedures should describe how to handle exceptions and program failures
Verify that data consolidation of duplicate or merged data is properly handled
Verify that samplings of domain transformations will be utilized to confirm they are properly changed
Ensure unique values exist for primary and foreign key fields between the source data and the data loaded to the warehouse
Validate that target data types are as specified in the design and/or the data model
Verify that data field types and formats are specified and implemented
Verify that default values are specified for fields where needed
Verify that processing for invalid field values in the source are defined
Verify that expected ranges of field values are specified
Verify that all keys generated by the ETL “sequence generator” are identified
Verify that slowly-changing dimensions (SCD’s) are described

Conclusion

Data warehouse testing is frequently deferred until late in the project life-cycle. If testing is shortchanged (e.g., due to schedule overruns or limited resource availability), there’s a high risk that critical data integrity issues will slip through the verification efforts. Even if thorough testing is performed, it’s difficult and costly to address most data integrity issues exposed by this late-cycle testing.

When testing during a late DW/BI life-cycle phase, the cause of errors can be anything from data quality issues occurring when the data enters the data warehouse, to a data processing issue caused by failures of the business logic along layers of data warehouse loading and its BI reporting components.

Share0

Tweet0

About the author

Wayne Yaddow

Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

Data quality considerations during the DW/ BI design phase

Common data quality issues to be discovered during DW/BI design

Conclusion

Wayne Yaddow

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Managing DW/ BI data integration risks through data reconciliation and data lineage processes

Main considerations for testing BI reports

Data quality considerations during the DW/ BI design phase