What is the difference between data classification and data categorization?

Data classification is often used as a synonym with data categorization. Are they the same? Not quite, so let's clarify this confusion as these terms are often used interchangeably when in fact they shouldn't.

Is there a difference between data classification and data categorization in the information and data environment? Some say this: "There's no difference, it's the same thing. Data classification, data categorization. Potato, potahto."

Other think otherwise. What makes it equally confusing is that these terms are sometimes used interchangeably and sometimes they reference each other. So are they different? According to this research paper, "Classification and Categorization: A Difference that Makes a Difference", they are.

Table of Contents

What is data classification?

What is data categorization?

Examples of data classification and data categorization

Relationship between data classification and data categorization

Conclusion

What is data classification?

Classiﬁcation as a process involves the orderly and systematic assignment of each entity to one and only one class within a system of mutually exclusive and nonoverlapping classes.

You can watch the following video on data classification and its importance if you'd like to find out more, but the key takeaway is that assignment of each entity to one and only one class that is mutually exclusive from other classes.

In data management, in particular within data privacy and security, data classification is used to tag structured and unstructured data most often according to its sensitivity level into mutually exclusive categories such as:

High sensitivity data
Medium sensitivity data
Low sensitivity data

What is data categorization?

Categorization is the process of dividing the world into groups of entities whose members are in some way similar to each other.

So data could then be categorized as high sensitivity data, medium sensitivity data and low sensitivity data. The difference is that these groups referred in the data categorization don't need to be mutually exclusive, but in data classification they have to.

data classification vs data categorization

Examples of data classification and data categorization

1. Manufacturing example

Let's say that we need to organize a list of products that a company manufactures.
Let's say that they produce products such as: bunk beds, adjustable beds, cradles, waterbeds, murphy beds, couches, canape, Klippan, futons, etc.
Some of these can be categorized as beds, some as couches, and some could go under either. Such as the futon:

data classification vs data categorization example

As data classification, it would basically have to go either under the Bed area or under the Couch area.

2. Social Insurance Number example

A Social Insurance Number can be categorized under "Employee" data as well as under "Customer" data, depending who it belongs to. Then you can also have sub-categories such as:

Employee data
- Governmental Identification
  - Social Insurance Number
  - Passport Number
  - Permanent Resident Card Number
  - Military Identification Number
- Employee Unique Identifier
Customer data
- Governmental Identification
  - Social Insurance Number
  - Driver License Number
- Customer Unique Identifier

So it can belong in both of these categories. If we are classifying the Social Insurance Number, it can only go under one of the following:

High sensitivity data
Medium sensitivity data
Low sensitivity data

The Social Insurance Number would be considered high sensitivity data.

If you'd like more examples, check out this video here:

Relationship between data classification and data categorization

Usually, a human or even a software would usually first categorize the data. Think of it as many different ways of slicing and dicing your data.

data classification vs data categorization process

Once that's done, a different process kicks in and assigns the applicable sensitivity level based on some pre-determined rules.

For example, you could say that:

Regardless of the type of file, if it is categorized as a health record then that will be classified as high sensitivity
All job postings will be classified as low sensitivity data, but all job postings that include data that was categorized under passport number that will go up to high sensitivity and so on

And this process could be manual and done by a human or automatic and done by a script or a program.

Conclusion

I hope that this clarifies the relationship between data classification and data categorization and their differences. In the end you'll encounter them as synonyms or as being different terms representing different processes. That's why I recommend finding out their definition from the person you're talking to, from the article you're reading, from the vendor pitching in their solution.

How do you define data categorization and data classification? Do you differentiate between them?

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

What is the difference between data classification and data categorization?

What is data classification?

What is data categorization?

Examples of data classification and data categorization

Relationship between data classification and data categorization

Conclusion

George Firican

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

You may also like:

Human in the Loop AI: Why It’s Often Just a Checkbox

The 6 layers of AI governance: A practical AI governance framework

How AI Is Reinventing MDM and Data Governance