What is a data catalog?

February 5, 2020

what is a data catalog

The easiest way to understand what a data catalog is to first understand its purpose. For that, let’s take a trip down the memory lane and reminisce about the days of the product catalogs. Does anyone remember the old Macy’s catalogs? It was basically a catalog that one would browse through for clothes, cooking items, food, and other stuff. Then you would put the order in and a few weeks later, that item would arrive at your house, post office or at a general store. Wholesale stores like Costco, furniture stores like Ikea,  or retailer stores like Target, still mail their own version of a catalog to their current or prospect customers. Then again we also have updated versions of this catalog. Let’s call it a catalog 2.0! Probably the best example of this improved catalog, at least at the moment, is Amazon.com. Amazon carries millions of different products and yet, as consumers, we can find almost anything fairly quickly. For example, I was searching for books on moving abroad and within a few seconds I can identify the item I was looking for. In this example it’s the book: “Moving Abroad – the Essentials“. Beyond Amazon’s advanced search capabilities, they also give detailed information on each product such as:

moving-abroad
Figure 1: Product Page Example in Amazon
  • A detailed description of the product
  • A summary of reviews compiled from different buyers, itemized if need be
  • Various versions and formats of the product
  • A list of related products
  • Various details pertinent to the type of product you’re viewing, in this case: the publisher, number of pages, language, product dimensions and weight, etc.
  • Instructional videos when applicable, the seller’s information
  • Cost, availability, shipping costs and estimates

Data catalogs work in the same way for your databases, but they can also be used for data lakes or data warehouses, and any data store. In short, they allow you to find the data that you need and provide useful metadata about them.

What is a data catalog?

So, a Data catalog is

An enterprise-wide asset providing a single reference source for the location of any data set required for various needs.

These needs can fall under the categories of Operational, Business Intelligence, Analytics, Data Science, etc. A data catalog:

  • is an enterprise-wide inventory or directory of data sets
  • helps organize the thousands or millions of an organization’s data sets to help users perform searches for specific data and understand its meta data, such as data lineage, and uses, and even how others perceive the data’s value
  • offers the end user the ability to locate information and further provides the mapping between the business glossary and data dictionaries

Yes, it does tie into a business glossary and a data dictionary as its business and technical metadata is referenced from those other 2 artifacts.

Example of a data catalog

If you are looking for some concrete examples, let’s look at some. I think that the best examples come from open data catalogs, such as those offered by municipalities or governments. What I would like to note is that not all data catalogs are created equal. They don’t all have the same features, types of metadata, and user interface, but here is a high level overview of catalog.data.gov

data catalog example government brief
Figure 2: US Government Open Data Catalog

As we can see from figure 2, the US Government’s open data catalog hosts over 230,000 data sets across 14 different topics. Similarly to Amazon, one can search or browse for the data set they are looking for or just start to explore it. Going into one of the topics provides us with another Amazon similarity as we can see below in Figure 3.

data catalog example government
Figure 3: Data Catalog Example by Topic

One can search, sort or filter by different topics, categories, and tags. Each individual data set appearing in the results quickly provides a description, the type of available formats, and number of types it has been accessed. In the case of US Government data, this last metric is what makes a user determine the popularity of a data set.

In Figure 4, let’s go into further detail of a particular data set.

data catalog example government details
Figure 4: Detailed Example of a Data Set

What I like about this is that you can quickly determine:

  • who the data steward is and how to contact them
  • who the owner is (or the publisher in this case)
  • when the data was created, last updated, and its update frequency
  • its access and usage rights
  • how to download the data that are part of this data set
  • how to access other data sets that have the same tag(s)
data catalog example government metadata
Figure 5: Metadata Details

Additionally, there is plenty metadata that one can access, such as:

  • its unique identifier
  • details on data schema – if you access that link you will basically see a version of a data dictionary filled with technical metadata for the fields available in the data set. Note: there is also a reference to a data dictionary, but in this case it just links to the sources of the data set.
  • licensing details and many other things

Conclusion

Again, not all data catalogs are the same and there’s advantages and disadvantages in each one of them. I encourage you to look into others, such as Open Data BC, Open Data UK, Open Data Vancouver (one of my favorites as it also includes a built-in analysis tool that allows one to perform simple insights into a particular data set right from the browser).

 

 

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

About the author 

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

You may also like:

George Firican

02/05/2020

What is a data catalog?

>