When reference data comes up in discussions with other data management or data governance practitioners, we sometimes realize we mean different things or we only overlap in our definitions 80% of the time. Within a data governance conversation, this seems a bit ironic. So what is reference data, in the context of data management (let’s not worry about programming languages at this point)? I’ve been looking for that one standardized definition and of course, I could not find one – again, in the context of data governance, I see this as another irony.
One thing that’s agreed upon is that managing reference data is important. It is important, because:
- it’s estimated that anywhere between 20-50% of the tables in a database house reference data
- the data quality issues of reference data will have a cascading effect in data analysis, reporting and data integration
Read more about the 5 best practices for managing reference data.
Though, we cannot properly manage what we don’t know, and so we need to have a clear consensus to what reference data is. First, let’s look at the:
Reference data characteristics:
- It is not created or it does not change as often as master data – Once you’ve loaded your table with currency types, you wouldn’t have to update it often. For example, the “new” Euro currency became into effect on January 1st, 1999 and redemption after legal tender of the currency it replaced is considered indefinitely, in some cases, and in some cases they have an official date.
- Shared by multiple systems within or outside the enterprise – For example, the list of countries, sex and gender codes, types of diseases, units of measurement, etc.
- It does not describe things that the enterprise does business with, but rather it categorizes the data which describes the enterprise’s transactions and master data – Such as the type of products, status of the orders, location of the customers, etc.
- Each piece of data has a distinct definition – Ex: the type of an organization could be a corporation, foundation, government corporation, non-profit organization, and so on, each with its specific definition
- Often defined by 3rd party bodies – Ex: ISO, UN, WHO, etc.
Besides being defined by the 3rd party bodies, which are either business domain specific and/or world-wide such as World Health Organization, they can also be organization specific. Therefore, reference data can be split into 3 categories:
- Universal reference data
- Industry reference data
- Internal reference data
Here are some reference data examples across these categories:
So what is reference data?
A set of permissible values associated with a distinct definition, used within a system or shared between multiple systems in an organization, domain or industry, which provides a standardized semantic to further categorize a data record.
Reference data is your status codes, product codes, flags and attributes, lookup tables, categories, and so on. From an end user perspective, it’s usually what you find in drop down fields.
Once your reference data is understood, the conclusion is simple: organizations need to invest in a reference data management program to create operational efficiencies and aide the development of valuable information for all levels of the organization.
How do you define reference data? Please feel free to contribute with yours or improve the one above. Meanwhile read through others I found.
Other definitions
- IBM:
- Reference data refers to data that is used to categorize other data within enterprise applications and databases. – link to definition
- Simplicable:
Reference data is data that is used to structure and constrain other data. It is typically stable information with a known set of values that rarely change. – link to definition
- Danette McGilvray & Gwen Thomas:
Reference data are sets of values or classification schemas that are referred to by systems, applications, data stores, processes, and reports, as well as by transactional and master records. – link to definition