3 new ideas on improving the data warehouse lifecycle quality process

3 new ideas improving datawarehouse lifecycle quality process

Data warehousing for business intelligence and “big data initiatives” continues to gain significance as organizations become more aware of the benefits of decision oriented data warehouses. However, a key issue, with the rapid development and implementation of data warehouses, is that data quality defects are often injected during the multiple lifecycle stages of the warehouse.

A primary requirement is an efficient data warehouse (DW/BI) system process that reliably extracts, transforms, cleanses, and loads data from source systems on a 24 by 7 basis without impacting overall performance, scalability or reliability.

In this article, we present new ideas on a “beginning-to-end” data warehouse lifecycle quality process.

dw lifecycle quality process

1. Eliminating data quality errors

Injections of data quality problems occur during all phases of data warehousing:

data warehouse modeling and schema design,
ETL (extract, transformation, loading) design and coding,
integrating data from varied sources, and
running the ETL processes for data staging, cleaning, and loading.

It’s common, although undesirable, for problems and defects to emerge when populating the warehouse.

Data testing is often planned for the later phases of data warehouse projects. However, most Quality Assurance (QA) professionals agree that establishing a successful test planning and execution effort for the early project phases is one of the keys to success. As with other software development, the earlier data errors are detected, the less expensive it is to find and correct them. Besides that, planning early testing activities to be carried out during the design and before implementation gives project managers an effective means to regularly measure and document the project progress state.

2. QA efforts should begin early

Since the quality of a DW can be measured best with reference to a set of data requirements, a successful testing process begins with the gathering and documentation of end-user data requirements. As most end-users data requirements are about data analysis and data quality, it is inevitable that data warehouse testing will focus primarily on the ETL process on the one hand (this is sometimes called back-end testing), then on reporting and Online Analytical Processing (OLAP) on the other (front-end testing).

After gathering requirements, analysts develop conceptual, then detailed schemas to represent the user’s needs as an important reference for testing. DW and BI designers are responsible for logical schemata of data repositories and for data staging definitions that should be tested for efficiency and robustness.

The data architecture and model are necessary as a blueprint for any data warehouse. Understanding these artifacts and discovering potential defects help the QA team to comprehend the bigger picture of a data warehouse. The data model aids comprehension of the methods used for the key relationships between the major data sources. The relationship hierarchies and the depth of data throw light on the complexity of transformation rules.

In order to gain greater value from the QA team, it’s recommended to include them for quality assessments in all phases of data warehouse design and development and testing. Figure 1 shows typical DW project testing categories. Representative validations are shown associated with four significant QA contributions:

Database / schema testing
Data validation testing
Performance testing
Graphical User Interface (GUI) and business rule testing

Reviews, verifications, recommendations for improvement are among QA team contributions that aid in early removal of defects.

categories of dw testing — **Figure 1.** Categories of data warehouse testing to discover quality issues (graphic courtesy of Virtusa Corp.)

Here’s what no one tells you about developing a Master Test Plan for your DW/ BI project

3. QA team contributions during the DW/BI project lifecycle

Following are sample contributions and benefits from the QA team during the DW development lifecycle:

Planning for data integration and the ETL (data model, low-level DW design)

Testers gain an understanding of data to be reported by the application (e.g., profiling) and the tables upon which BI and other reports will be based
Testers review and understand the data model – gain an understanding of keys, ETL workflows and more from sources to targets
Testers review and become familiar with data Low Level Design (LLD) and mappings: add, update sequences for all sources for each target element

Planning for ETL verifications

Testers participate in ETL design reviews
The QA team gains an in-depth knowledge of ETL workflows / sessions, the order of job executions, restraints, transformations
Testers develop ETL test scenarios and distribute for reviews

Assessing ETL run and diagnostic logs: sessions, workflows, errors

After ETL’s are run, testers use checklists and test scenarios for QA assessments of rejects, session failures, errors
QA teams review ETL workflow outputs, source to target counts
Verify source to target mapping docs with loaded tables using a database management toolset (ex: TOAD, Navicat, MS SQL Server Management Studio, etc.)
After ETL runs or manual data loads, assess data in every table with a focus on key fields (dirty data, incorrect formats, duplicates, etc.). Use database management tools or simple spreadsheet tools. (SQL queries, filtering, etc.)

During DW design and development, testers plan and organize for the following DW quality categories:

Data completeness: Ensuring that all expected data is loaded
Data transformation: Ensuring that all data is transformed correctly according to business rules and/or design specifications
Data quality: Ensuring that the ETL system correctly rejects, substitutes default values, corrects (or ignores) invalid data
Performance and scalability: Ensuring that data loads and queries perform within expected time frames and that the technical architecture is scalable
Integration testing: Ensuring that the ETL process functions well with other upstream and downstream processes
User-acceptance testing: Ensuring the warehouse solution meets users’ current requirements and anticipates their future expectations
Regression testing: Ensuring that current functionality remains intact each time a new release of code is completed

Testing cannot guarantee that there will be no data errors. There are too many combinations and permutations, therefore, it is not practical to test each one. However, by joining forces with business analysts, database designers, developers and ranking the types of errors as suggested above, DW/BI projects will avoid wasting time on creating test scripts and test scenarios for less important possibilities and not having time to create test scripts and test scenarios for possibilities in which errors could significantly diminish or destroy the value of the data warehouse to the users.

Share0

Tweet0

About the author

Wayne Yaddow

Wayne Yaddow is an independent consultant with more than 20 years’ experience leading data integration, data warehouse, and ETL testing projects with J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, AIG, Oppenheimer Funds, and IBM. He taught IIST (International Institute of Software Testing) courses on data warehouse and ETL testing and wrote DW/BI articles for Better Software, The Data Warehouse Institute (TDWI), Tricentis, and others. Wayne continues to lead numerous ETL testing and coaching projects on a consulting basis. You can contact him at wyaddow@gmail.com.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

3 new ideas on improving the data warehouse lifecycle quality process

1. Eliminating data quality errors

2. QA efforts should begin early

Here’s what no one tells you about developing a Master Test Plan for your DW/ BI project

3. QA team contributions during the DW/BI project lifecycle

Wayne Yaddow

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

Transforming Marketing Data into Business Growth: Key Insights and Strategies

The future of generative AI’s form factor

You may also like:

Managing DW/ BI data integration risks through data reconciliation and data lineage processes

What you need to know about regression testing on DW/ BI projects

The what, the why, and the how of shift left testing for BI and Data Warehouse Projects