How to identify and reduce DW/ BI data quality risks
An introduction to DW/ BI data quality risk assessments
Data warehouse and business intelligence (DW/ BI) projects are showered with risks – from data quality in the warehouse to analytic values in BI reports. If not addressed properly, data quality risks can bring entire projects to a halt, leaving planners scrambling for cover, sponsors looking for remedies, and budgets being wiped out.
Usually, data consumers, such as end users, don’t often know exactly what they want delivered until they start seeing early versions of a BI application (e.g., reports). This circumstance often requires DW/BI teams to build the data warehouse and application reports before they are fully defined and specified. Couple this challenge with the data quality problems inherent when sourcing operational systems and the potential for risks are very real. (Editor note: Learn what are the attributes you should track to develop a comprehensive report inventory).
This article outlines methods for recognizing and minimizing the data quality risks often associated with DW/ BI projects. Addressing additional DW/ BI project risks (including performance, schedules, defects discovered late in the software development life cycle, etc.) is also important, but is beyond the scope of this piece.
Data Quality is the desired state where an organization’s data assets reflect the following attributes:
- Clearly defined
- Correct values in sources – during extraction, while loading to targets, and in analytic reports
- Understandable presentation format in analytic applications
- Usefulness in supporting targeted business processes.
Figure 1 illustrates primary control points (i.e., testing points ✓) in an end-to-end data quality auditing and reporting process.
Data quality risks should be addressed early and often
The extraction, transformation, and loading (ETL) process is still the most underestimated, under-budgeted part of most DW/ BI iterations. And the biggest reason why the ETL portion of a project often raises more questions than it resolves has to do with a lack of understanding of the source data quality along with the resulting data quality of ETL processes.
Before diving into the most common data quality risks for data warehouse and business intelligence projects and their risk management, let’s ensure we are on the same page with the following, as we will address them both:
Risk assessment is the conversion of risk assessment data into risk decision-making information. Risks are composed of two factors: (1) risk probability and (2) risk impact.
Data quality risk management is a structured approach for the identification, assessment, and prioritization of data quality risks followed by planning of resources to minimize, monitor, and control the probability and impact of undesirable events.
The most common DW/ BI project data quality risks are listed below, each accompanied by the probability / odds that they typically exist, their likely impacts, and recommendations for mitigating those risks. The listing is not intended to be exhaustive.
Data Quality Risks
Potential Risk Mitigation Tasks
HUMAN RESOURCE SKILLS
• Engage DW/BI training resources
• Recruit staff with DW experience
• Contract DW/ BI professional consultants
MASTER TEST PLAN/STRATEGY
• Create a test plan to document the overall structure and objectives of all project testing— from unit testing to component to system and performance testing. The plan should cover activities over the DW/ BI lifecycle and identify evaluation criteria for the testers.
SOURCE DATA QUALITY IN DOUBT
• Formal data profiling of all source data early (i.e., during requirements gathering) to understand whether data quality meets project needs
SOURCE DATA HISTORY INCOMPLETE
• What if your business requirement calls for four years of historical data, but the best, most recent data contains only one year for some sources and three years for other sources? The missing years would need to be extracted from other data sources, possibly of questionable quality. Follow the risk mitigation tasks listed above, addressing the data quality.
SOURCE & TARGET DATA MAPS SUSPECT
• Data dictionaries should be developed and maintained to support all data associated with the project. Quality data mapping documents may be the result.
ETL TARGET DATA IN ERROR
• Ensure that the target data sampling process is high quality
SOURCE & TARGET END TO END TESTING UNCOORDINATED
• This “auditability” must include validation that the information in a source system (such as a spreadsheet) is accurate so that there is a high level of confidence that it can be trusted when it’s loaded to the warehouse. Organizations that perform only quality checks on data at a sub-set of points in the warehouse will probably fail to adequately protect themselves from the data quality problems that emerge when information is exchanged between all of these “dynamic points.”
DATA DICTIONARIES AND DATA MODELS ARE LACKING
• Ensure accurate and current documentation of data models and mapping documents
EXCESSIVE DATA DEFECTS
• Ensure that data requirements are complete and that data dictionaries are available and current
COMPLEX DATA TRANSFORMATIONS
• Early validation of table join complexity, queries and resulting business reports
DATA VOLUME SCALABILITY IN DOUBT
• Employ toolsets for data volume estimations
DATA REQUIREMENTS INCOMPLETE
• Ensure that requirements are always updated after change requests are approved
REGRESSION TESTS NOT AUTOMATED
• Without automated regression tests, fewer tests may be run after builds are deployed; manual testing may result in fewer tests being run. Therefore, make sure your test plan accounts for this and ask for more time and/ or human resources to complete necessary testing
More potential data quality risks to consider:
- Legacy data architecture and data definition artifacts may be unavailable or incomplete to aid in project planning. Source data may be inaccurately mapped due to lack of (or outdated) legacy system data dictionary
- The project team may encounter incompatible software, hardware, and/or processes due to multiple operating systems or vendors, or format incompatibilities. Ex.: Database Management System (DBMS) to DBMS, DBMS to Operating System, etc.
- The integrity and quality of the converted data may be compromised due to lack of enterprise-wide data governance
- Independent data validation, the quality of the target system data may not meet the departmental standards because independent data validation (e.g., Quality Assurance department, off-shoring) was not considered part of the scope of work
- Source data may be inaccurately transformed and migrated due to lack of involvement of key business subject matter experts in the requirements and business rule process
The following DW/ BI project tasks should also be given a robust assessment and attention followed by, verification, or validation responsibilities associated with them to enhance data quality management throughout your data warehouse project:
- Business requirements collection and analysis
- Logical data modeling
- Data mapping
- Physical modeling and implementation
- Extraction, transformation, and loading (ETL) design
- Report/ data visualization and cube design
- Project planning
- Data quality management
- Testing and validation
- Data warehouse archiving
- Backup and recovery of the data warehouse
- Change management
- Return on Investment (ROI) determination
Data warehouse and business intelligence projects can be highly complex and inherently risky. It is the responsibility of the project manager to lead the DW/ BI team in identifying all data quality risks associated with a particular data warehouse implementation. The goal of this process is to document essential information relating to project risk.
If the project team and its designers fail to assess the quality of source data, then they are exposing the entire project to excessive risks. Consider this: if the DW/ BI team does not take the time to assess all source data quality, then it is entirely possible that you will purchase and install all the DW/ BI technology, do all the analysis, write all the source-to-target code to populate target tables, and still fail.