Digital Age and Data Explosion
The current industrial age known as the digital age, is characterized by an overabundance of data. The advancement in technology, decreasing cost of disk hardware, and availability of cloud storage has facilitated the collection/generation, processing and storage of large volumes of data at much lower cost. With the increasing number of electronic data-generating devices including smart devices (for example, smart phones, smart meters, smart cars, and smart thermostats), internet of things and cloud computing, organizations have been able to capture enormous volumes of data in a relatively short times, resulting in an explosion of data (Mahanti 2022).
Organizations have a large number of data entities and data elements and large amount of data stored in repositories and also flowing in and through the organizations’ data pipelines. However, treating all the data elements equally in terms of governance and managing quality, is not feasible approach to managing data. Hence it is important to prioritize data elements and isolate the key data elements and manage quality and governance of these key data elements. This is where the concept of critical data elements comes into picture .
In short, critical data elements are data elements that have a direct or indirect financial impact if the data quality is not up to the mark along one or more data-quality dimensions (Mahanti, 2019).
In this article we discuss some key concepts around data, why critical data elements are important and some pointer as to how to define the critical data elements.
Definition of key data terms used in this article
Before we continue, let's go through a few terminologies related to data used in this article (Mahanti, 2021).
- Data entities are the real-world objects, concepts, events, and phenomena about which we collect data. For example, customer is a one of the most common entities.
- Data elements are the different attributes that describe the data entity. For example, data elements of the customer entity might be a unique id to identify the customer, customer name, date of birth and status.
- Data quality is the capability of data to satisfy the stated business, system, and technical requirements of an enterprise. Data quality is an an evaluation of data’s fitness to serve their purpose in a given context. Data are considered of high quality if they are fit for their intended use (Mahanti 2019).
- Data quality dimensions are characteristics that would define the quality of a data. Data quality dimensions provide a means to quantify and manage the quality of data (Mahanti 2019). Referring to the “customer data entity” in our example, this would relate to the presence of useful values for each of the data elements in each record of the customer data entity, such as timely availability of the data, and accuracy and currency of the data.
- Data governance is the exercise and enforcement of policies, processes, guidelines, rules, standards, metrics, controls, decision rights, roles, responsibilities, and accountabilities to manage data as a strategic enterprise asset (Mahanti 2021a).
Concept of critical data elements (CDEs) and why they are important
A critical data element can be defined as a data element that supports enterprise obligations or critical business functions or processes, and will cause customer dissatisfaction, pose a compliance risk, or have a direct financial impact if the data quality is not up to the mark along one or more data-quality dimensions (Mahanti 2019).
Given the vast number of data elements and large of volumes of that an organization stores, ensuring the quality of organization’s entire data and governing all data with the same rigor is an expensive and resource-intensive exercise and one that is not recommended. After all, not everything that can be counted, counts! That is where the concept of critical data elements comes into picture.
All data are not created equal and hence do not have the same level of importance. Some data elements are critical and have direct or indirect financial impact if not for for use or exposed to unauthorized access. Sensitive data such as social security number is also a critical data element. Sometimes a data element alone might not be deemed sensitive but becomes sensitive when in a group of data elements. Personally identifiable information (PII), Personal Health Information (PHI) and Payment Card Industry (PCI) data are examples this scenario. Organizations must ensure that critical data elements are of high quality, govern them rigorously, and ensure that they are fit for their intended use (Mahanti, 2019, Mahanti, 2021a, Mahanti 2021b).
Some data elements are moderately critical and require less rigorous governance and quality assessment processes. On the other hand, some data elements might not be of any value and assessing their quality or having rigorous or even moderate governance for them is a waste of time, money, and effort. Non-sensitive data need to be governed lightly.
For example, many data values are captured and stored for dubious reasons, such as being part of a purchased data model, or retained from a data migration project, but they may not be necessary to achieve any business objectives. Managing the quality of such data is a waste of time and effort (Mahanti 2019) and such data need light governance (Mahanti 2021a).
How to determine critical data elements?
Trying to govern, measure, and manage the quality of all data elements and data that an organization has, can be an overwhelming and financially infeasible exercise that is bound to fail. Determining and prioritizing critical data elements is one of the first steps to successfully managing data to meet business objectives.
Establishing CDEs at an enterprise level would require a factor rating method to rate data elements and CDEs. This would involve determining the factors and the rating and/or weights associated with each factor and defining a formula to calculate the score and sort the data elements in the descending order of rank.
Two most common factors used to determine critical data elements are:
Number of enterprise obligations or business use cases for which the data elements are critical. Data elements that are critical for one enterprise obligation or use case may not be critical for another enterprise obligation or use case.
The impact if the data element is not fit for use or is exposed to unauthorized access or is stolen. Impacts can be financial, regulatory, legal, reputational damage, customer dissatisfaction or a combination of these. Sensitive data elements are high impact as there is huge financial cost associated with them if stolen or manipulated. Sometimes a data element alone might not be deemed sensitive but becomes sensitive when in a group of data elements and such elements should be considered as high impact. The Likert scale can be used to to assign a rating based on the level of impact as shown below in Table 1: Impacts Rating.
3. The criticality score (C) is calculated as follow:
Criticality Score (C) = Number of enterprise obligations or business use cases for which the data elements are critical (N) * Impact Rating (I)
Table 2 shows a sample data element factor rating matrix.
Table 2: Sample data element factor rating matrix
Data Elements (DE)
Number of enterprise obligations or business use cases for which the DE is critical (N)
Impact Rating (I)
Criticality Score C=(N*I)
Social Security Number
Customer mobile number
Data element n
The data elements with the highest score ( for example, country code and social security numbers have the highest scores in Table 2) should be taken into consideration first from a management, governance and quality improvement perspective if needed, followed by lower scores (for example customer status and customer mobile number). Given the costs involved, usually 200 to 250 data elements in an organization are targeted for CDE prioritization.
Determining and prioritizing critical data elements is one of the first steps to successfully managing data to meet business objectives.
This article discusses why CDEs are important and outlines a simple methodology to determine, rank, and prioritize CDEs. There might be data elements that have higher criticality scores, but the cost to strategically fix these CDEs might be high and/or not feasible. After all, not everything that counts can be counted! Hence critical data elements with lower score might be targeted for data quality improvement.
As with my previous article on CDEs, I would like to wind off this article with famous quote from renowned physicist’s Albert Einstein, which applies to data elements and CDEs.
“Not everything that can be counted counts, and not everything that counts can be counted.”
This article draws significantly from the research presented in the books- Data Governance and Compliance: Evolving to Our Current High Stakes Environment, Data Governance and Data Management: Contextualizing Data Governance Drivers, Technologies, and Tools published by Springer in 2021, and Data Quality: Dimensions, Measurement, Strategy, Management and Governance published by ASQ Quality Press in 2019.