How to identify and reduce DW/ BI data quality risks

An introduction to DW/ BI data quality risk assessments

Data warehouse and business intelligence (DW/ BI) projects are showered with risks – from data quality in the warehouse to analytic values in BI reports. If not addressed properly, data quality risks can bring entire projects to a halt, leaving planners scrambling for cover, sponsors looking for remedies, and budgets being wiped out.

Usually, data consumers, such as end users, don’t often know exactly what they want delivered until they start seeing early versions of a BI application (e.g., reports). This circumstance often requires DW/BI teams to build the data warehouse and application reports before they are fully defined and specified. Couple this challenge with the data quality problems inherent when sourcing operational systems and the potential for risks are very real. (Editor note: Learn what are the attributes you should track to develop a comprehensive report inventory).

This article outlines methods for recognizing and minimizing the data quality risks often associated with DW/ BI projects. Addressing additional DW/ BI project risks (including performance, schedules, defects discovered late in the software development life cycle, etc.) is also important, but is beyond the scope of this piece.

Data Quality is the desired state where an organization’s data assets reflect the following attributes:

  • Clearly defined
  • Correct values in sources – during extraction, while loading to targets, and in analytic reports
  • Understandable presentation format in analytic applications
  • Usefulness in supporting targeted business processes.

Figure 1 illustrates primary control points (i.e., testing points ) in an end-to-end data quality auditing and reporting process.

checkpoints for DW/BI data quality

Figure 1: A few of the many checkpoints that are necessary to adequately audit data quality for DW/ BI projects.

Data quality risks should be addressed early and often

The extraction, transformation, and loading (ETL) process is still the most underestimated, under-budgeted part of most DW/ BI iterations. And the biggest reason why the ETL portion of a project often raises more questions than it resolves has to do with a lack of understanding of the source data quality along with the resulting data quality of ETL processes.

Before diving into the most common data quality risks for data warehouse and business intelligence projects and their risk management, let’s ensure we are on the same page with the following, as we will address them both:

Risk assessment is the conversion of risk assessment data into risk decision-making information. Risks are composed of two factors: (1) risk probability and (2) risk impact.

Data quality risk management is a structured approach for the identification, assessment, and prioritization of data quality risks followed by planning of resources to minimize, monitor, and control the probability and impact of undesirable events.

The most common DW/ BI project data quality risks are listed below, each accompanied by the probability / odds that they typically exist, their likely impacts, and recommendations for mitigating those risks. The listing is not intended to be exhaustive.

#

Data Quality Risks

Odds

Impact

Potential Risk Mitigation Tasks

1

HUMAN RESOURCE SKILLS
Insufficiently qualified resources with required knowledge of data warehouse and business intelligence testing; lack of skills with data testing toolsets, methods and best practices.

Medium

High

• Engage DW/BI training resources

• Recruit staff with DW experience

• Contract DW/ BI professional consultants

2

MASTER TEST PLAN/STRATEGY
A master test plan/strategy does not exist or is inadequate in scope

Medium

High

• Create a test plan to document the overall structure and objectives of all project testing— from unit testing to component to system and performance testing. The plan should cover activities over the DW/ BI lifecycle and identify evaluation criteria for the testers.

3

SOURCE DATA QUALITY IN DOUBT
Data integration effort may not meet the planned schedule because the quality of source data is unknown

High

High

• Formal data profiling of all source data early (i.e., during requirements gathering) to understand whether data quality meets project needs
• Inaccuracies, omissions, cleanliness, and inconsistencies in the source data should be identified and resolved before or during the extract / transform process
• Often, specific data elements exist on multiple source systems. Identify the various sources and discuss with the users which are the most applicable
• Use of commercial data quality tools accompanied by consultation and training

4

SOURCE DATA HISTORY INCOMPLETE
Differing levels of historical data among all source data

Medium

High

• What if your business requirement calls for four years of historical data, but the best, most recent data contains only one year for some sources and three years for other sources? The missing years would need to be extracted from other data sources, possibly of questionable quality. Follow the risk mitigation tasks listed above, addressing the data quality.
• Flag this risk early to project sponsor(s) and see if a change in business requirements is desirable 

5

SOURCE & TARGET DATA MAPS SUSPECT
Source data may be inaccurately mapped due to absence of data dictionaries, meta data, or data models

Medium

High

• Data dictionaries should be developed and maintained to support all data associated with the project. Quality data mapping documents may be the result.

6

ETL TARGET DATA IN ERROR
Only a subset of the loaded data could be tested

Medium

High

• Ensure that the target data sampling process is high quality
• Use test tools that allow for extensive data coverage
• Choose a data sampling approach that’s extensive enough to avoid missing defects in both source and target data
• Choose an appropriate technology to match source and target data to determine whether both source and target are equal or target data has been transformed
• Verify that no data or information is lost during ETL processes. The data warehouse must get all relevant data from the source application into the target according to business rules

7

SOURCE & TARGET END TO END TESTING UNCOORDINATED
Poor or non-existent testing of source to warehouse data flows

High

High

• This “auditability” must include validation that the information in a source system (such as a spreadsheet) is accurate so that there is a high level of confidence that it can be trusted when it’s loaded to the warehouse. Organizations that perform only quality checks on data at a sub-set of points in the warehouse will probably fail to adequately protect themselves from the data quality problems that emerge when information is exchanged between all of these “dynamic points.”

8

DATA DICTIONARIES AND DATA MODELS ARE LACKING
Data and information within warehouse and/ or marts cannot be easily interpreted by developers and quality assurance (QA) team

Medium

High

• Ensure accurate and current documentation of data models and mapping documents
• Use automated documentation tools
• Create meaningful documentation of data definitions and data descriptions in a data dictionary
• Create procedures for maintaining documentation in line with changes to the source systems
• Provide training to QA team by data stewards/ owners

9

EXCESSIVE DATA DEFECTS
Data defects are found at the late stage of each iteration

High

High

• Ensure that data requirements are complete and that data dictionaries are available and current
• Profile all data sources and target sources after each ETL
• Ensure that data mapping and all other specification documents are kept current

10

COMPLEX DATA TRANSFORMATIONS
Complex data transformations and BI reports

High

High

• Early validation of table join complexity, queries and resulting business reports
• Validation and clarification of business requirements, as well as early and careful translation of the data requirements
• Validation of the number and accessibility of source data fields

11

DATA VOLUME SCALABILITY IN DOUBT
Growing data volumes due to changing requirements

Medium

High

• Employ toolsets for data volume estimations
• Consider technical design for data volumes to be done by experienced database administrators/ data architects

12

DATA REQUIREMENTS INCOMPLETE
Quality issues due to unclear or non-existent data requirements documentation

Medium

High

• Ensure that requirements are always updated after change requests are approved
• Have the data requirements documentation as part of the project’s deliverables
• Perform validation and clarification of requirements, as well as early and careful translations of the data requirements in relation to business and technical requirements

13

REGRESSION TESTS NOT AUTOMATED
Minimal automation of regression tests

Medium

High

• Without automated regression tests, fewer tests may be run after builds are deployed; manual testing may result in fewer tests being run. Therefore, make sure your test plan accounts for this and ask for more time and/ or human resources to complete necessary testing

More potential data quality risks to consider:

  • Legacy data architecture and data definition artifacts may be unavailable or incomplete to aid in project planning. Source data may be inaccurately mapped due to lack of (or outdated) legacy system data dictionary
  • The project team may encounter incompatible software, hardware, and/or processes due to multiple operating systems or vendors, or format incompatibilities. Ex.: Database Management System (DBMS) to DBMS, DBMS to Operating System, etc.
  • The integrity and quality of the converted data may be compromised due to lack of enterprise-wide data governance
  • Independent data validation, the quality of the target system data may not meet the departmental standards because independent data validation (e.g., Quality Assurance department, off-shoring) was not considered part of the scope of work
  • Source data may be inaccurately transformed and migrated due to lack of involvement of key business subject matter experts in the requirements and business rule process

The following DW/ BI project tasks should also be given a robust assessment and attention followed by, verification, or validation responsibilities associated with them to enhance data quality management throughout your data warehouse project:

  1. Business requirements collection and analysis
  2. Logical data modeling
  3. Data mapping
  4. Physical modeling and implementation
  5. Extraction, transformation, and loading (ETL) design
  6. Report/ data visualization and cube design
  7. Project planning
  8. Data quality management
  9. Testing and validation
  10. Data warehouse archiving
  11. Backup and recovery of the data warehouse
  12. Change management
  13. Return on Investment (ROI) determination

 Concluding Remarks

Data warehouse and business intelligence projects can be highly complex and inherently risky. It is the responsibility of the project manager to lead the DW/ BI team in identifying all data quality risks associated with a particular data warehouse implementation. The goal of this process is to document essential information relating to project risk.

If the project team and its designers fail to assess the quality of source data, then they are exposing the entire project to excessive risks. Consider this: if the DW/ BI team does not take the time to assess all source data quality, then it is entirely possible that you will purchase and install all the DW/ BI technology, do all the analysis, write all the source-to-target code to populate target tables, and still fail.

%d bloggers like this: