What is a data catalog?
The easiest way to understand what a data catalog is to first understand its purpose. For that, let’s take a trip down the memory lane and reminisce about the days of the product catalogs. Does anyone remember the old Macy’s catalogs? It was basically a catalog that one would browse through for clothes, cooking items, food, and other stuff. Then you would put the order in and a few weeks later, that item would arrive at your house, post office or at a general store. Wholesale stores like Costco, furniture stores like Ikea, or retailer stores like Target, still mail their own version of a catalog to their current or prospect customers. Then again we also have updated versions of this catalog. Let’s call it a catalog 2.0! Probably the best example of this improved catalog, at least at the moment, is Amazon.com. Amazon carries millions of different products and yet, as consumers, we can find almost anything fairly quickly. For example, I was searching for books on moving abroad and within a few seconds I can identify the item I was looking for. In this example it’s the book: “Moving Abroad – the Essentials“. Beyond Amazon’s advanced search capabilities, they also give detailed information on each product such as:
- A detailed description of the product
- A summary of reviews compiled from different buyers, itemized if need be
- Various versions and formats of the product
- A list of related products
- Various details pertinent to the type of product you’re viewing, in this case: the publisher, number of pages, language, product dimensions and weight, etc.
- Instructional videos when applicable, the seller’s information
- Cost, availability, shipping costs and estimates
Data catalogs work in the same way for your databases, but they can also be used for data lakes or data warehouses, and any data store. In short, they allow you to find the data that you need and provide useful metadata about them.
What is a data catalog?
So, a Data catalog is
An enterprise-wide asset providing a single reference source for the location of any data set required for various needs.
These needs can fall under the categories of Operational, Business Intelligence, Analytics, Data Science, etc. A data catalog:
- is an enterprise-wide inventory or directory of data sets
- helps organize the thousands or millions of an organization’s data sets to help users perform searches for specific data and understand its meta data, such as data lineage, and uses, and even how others perceive the data’s value
- offers the end user the ability to locate information and further provides the mapping between the business glossary and data dictionaries
Yes, it does tie into a business glossary and a data dictionary as its business and technical metadata is referenced from those other 2 artifacts.
Example of a data catalog
If you are looking for some concrete examples, let’s look at some. I think that the best examples come from open data catalogs, such as those offered by municipalities or governments. What I would like to note is that not all data catalogs are created equal. They don’t all have the same features, types of metadata, and user interface, but here is a high level overview of catalog.data.gov
As we can see from figure 2, the US Government’s open data catalog hosts over 230,000 data sets across 14 different topics. Similarly to Amazon, one can search or browse for the data set they are looking for or just start to explore it. Going into one of the topics provides us with another Amazon similarity as we can see below in Figure 3.
One can search, sort or filter by different topics, categories, and tags. Each individual data set appearing in the results quickly provides a description, the type of available formats, and number of types it has been accessed. In the case of US Government data, this last metric is what makes a user determine the popularity of a data set.
In Figure 4, let’s go into further detail of a particular data set.
What I like about this is that you can quickly determine:
- who the data steward is and how to contact them
- who the owner is (or the publisher in this case)
- when the data was created, last updated, and its update frequency
- its access and usage rights
- how to download the data that are part of this data set
- how to access other data sets that have the same tag(s)
Additionally, there is plenty metadata that one can access, such as:
- its unique identifier
- details on data schema – if you access that link you will basically see a version of a data dictionary filled with technical metadata for the fields available in the data set. Note: there is also a reference to a data dictionary, but in this case it just links to the sources of the data set.
- licensing details and many other things
Again, not all data catalogs are the same and there’s advantages and disadvantages in each one of them. I encourage you to look into others, such as Open Data BC, Open Data UK, Open Data Vancouver (one of my favorites as it also includes a built-in analysis tool that allows one to perform simple insights into a particular data set right from the browser).