Data classification is often used as a synonym with data categorization. Are they the same? Not quite, so let's clarify this confusion as these terms are often used interchangeably when in fact they shouldn't.
Is there a difference between data classification and data categorization in the information and data environment? Some say this: "There's no difference, it's the same thing. Data classification, data categorization. Potato, potahto."
Other think otherwise. What makes it equally confusing is that these terms are sometimes used interchangeably and sometimes they reference each other. So are they different? According to this research paper, "Classification and Categorization: A Difference that Makes a Difference", they are.
What is data classification?
Classification as a process involves the orderly and systematic assignment of each entity to one and only one class within a system of mutually exclusive and nonoverlapping classes.
You can watch the following video on data classification and its importance if you'd like to find out more, but the key takeaway is that assignment of each entity to one and only one class that is mutually exclusive from other classes.
In data management, in particular within data privacy and security, data classification is used to tag structured and unstructured data most often according to its sensitivity level into mutually exclusive categories such as:
- High sensitivity data
- Medium sensitivity data
- Low sensitivity data
What is data categorization?
Categorization is the process of dividing the world into groups of entities whose members are in some way similar to each other.
So data could then be categorized as high sensitivity data, medium sensitivity data and low sensitivity data. The difference is that these groups referred in the data categorization don't need to be mutually exclusive, but in data classification they have to.
Examples of data classification and data categorization
1. Manufacturing example
Let's say that we need to organize a list of products that a company manufactures.
Let's say that they produce products such as: bunk beds, adjustable beds, cradles, waterbeds, murphy beds, couches, canape, Klippan, futons, etc.
Some of these can be categorized as beds, some as couches, and some could go under either. Such as the futon:
As data classification, it would basically have to go either under the Bed area or under the Couch area.
2. Social Insurance Number example
A Social Insurance Number can be categorized under "Employee" data as well as under "Customer" data, depending who it belongs to. Then you can also have sub-categories such as:
- Employee data
- Governmental Identification
- Social Insurance Number
- Passport Number
- Permanent Resident Card Number
- Military Identification Number
- Employee Unique Identifier
- Governmental Identification
- Customer data
- Governmental Identification
- Social Insurance Number
- Driver License Number
- Customer Unique Identifier
- Governmental Identification
So it can belong in both of these categories. If we are classifying the Social Insurance Number, it can only go under one of the following:
- High sensitivity data
- Medium sensitivity data
- Low sensitivity data
The Social Insurance Number would be considered high sensitivity data.
If you'd like more examples, check out this video here:
Relationship between data classification and data categorization
Usually, a human or even a software would usually first categorize the data. Think of it as many different ways of slicing and dicing your data.
Once that's done, a different process kicks in and assigns the applicable sensitivity level based on some pre-determined rules.
For example, you could say that:
- Regardless of the type of file, if it is categorized as a health record then that will be classified as high sensitivity
- All job postings will be classified as low sensitivity data, but all job postings that include data that was categorized under passport number that will go up to high sensitivity and so on
And this process could be manual and done by a human or automatic and done by a script or a program.
Conclusion
I hope that this clarifies the relationship between data classification and data categorization and their differences. In the end you'll encounter them as synonyms or as being different terms representing different processes. That's why I recommend finding out their definition from the person you're talking to, from the article you're reading, from the vendor pitching in their solution.
How do you define data categorization and data classification? Do you differentiate between them?