5 reasons why we have dark data

Dark data is probably here to stay, at least for the next few years. Most organizations have it, if not all. But why is that? Why is dark data so prevalent? Here are the 5 reasons why we have dark data.

Table Of Contents

1. Different priorities

2. Lack of data governance

3. Poor data quality

4. Constraints from tools and skills

5. Inexpensive storage

Conclusion

1. Different priorities

The first reason why organizations may have dark data is the fact that there are different priorities that an organization focuses on and sometimes these priorities are a little bit lopsided.

What do I mean? Let me give you an example. Let's say there's a bank analyzing online applications for credit cards and the credit card marketing team is focused solely on customer details and eligibility. Sure, that's fair, but at the same time, no attention is paid to the data on how the customer arrived to the application page. This unattended data could have provided valuable insights on the usability of the bank website and in particular the application page. But there's no priority assigned to this aspect, unfortunately.

There are a lot of organizations not investing in web analytics even though they are capturing and storing that data. Because they're not using it at all it just becomes dark data.

2. Lack of data governance

When you don’t have data governance there’s a higher chance of your organization operating in silos. Of course, that's not the only downside created by a lack of data governance, but it's definitely one with a direct impact into the creation of dark data. Here's why.

In a lot of organizations departments have their own data collection and storage processes which may not be known to other departments. So this data might be collected and remain unused even if it is relevant to other departments and it could be used by these other departments, but they don’t have a process to even find out about it, let alone use it. Frustrating, isn't it?

Do you want to learn more?

Practical Data Governance: Implementation - online course

Learn how to implement a data governance program from scratch or improve the one you have.

Check out more

3. Poor data quality

Poor data quality creates overall havoc in any organization, sometimes unbeknown to the organization. In this case, if the data collected is incomplete or potentially inaccurate or there is a lack of trust in it due to a lack of data quality and governance and even unclarity of how it was collected there is a high chance that this data will not be used.

Even if you have important customer information about a transaction, but it’s missing location or other important metadata because that information sits somewhere else, or was not captured in useable format, or it wasn't captured properly, or in its entirety... well most likely this data will not be used because it's poor quality data from the start.

I'll give you another example. Let’s say we have the audio recordings from a call enter and the AI doing the transcript is not providing good results. Maybe because of the quality of the call, or the algorithm itself, or the lack of enough proper data.

Well, you’re not going to feed that incomplete transcript, that incomplete data to another process to analyze it, to maybe understand the sentiment associated with the customer on the other line. So that becomes dark data. Yes, that frustration carries on here, too.

4. Constraints from tools and skills

If data collection is done by separate technologies and tools in the same organization, there may be cases that these technologies and tools do not interact with each other because of technological constraints. For example, it may be difficult to integrate audio file contents from the call center mentioned above with click data from that organization's website.

No tool to capture and unlock dark data

Not having a tool to analyze dark data is listed as the number 1 reason on why companies aren’t using dark data.

(according to Datumize)

Not having a tool to analyze dark data is listed as the number 1 reason on why companies aren’t using dark data, according to Datumize.

Or there might be a lack of knowledge and skills on how to integrate this data, how to analyze it, how to drive meaning from it. It could be that they only know how to use structured data. Why would the data get collected in the first place, then? Well it usually gets collected because other organizations are doing it and the organization wants to capture it for future use when they will have the capabilities to use it.

The larger the dataset, or better said the less structured it is the more sophisticated the tool required for analysis. Additionally, these kinds of datasets often time require analysis by individuals with significant data science expertise who are often is short supply. Believe it or not. Organizations that are at the early stages of a data analytics or a data science program face these problems.

5. Inexpensive storage

The last one and maybe the most obvious reasons why organizations have dark data is the fact that in the grand scheme of things it is inexpensive to store it.

Maybe that's why we have so many photos on our phones. I remembered a time when we used to have these film cameras and my parents and I used to go on trips and they would just bring a couple of rolls of film and take photos. We would maybe take 24 or 48 photographs during the entire trip. Now with our phones, we're taking hundreds of photos just from the one place, not even talking about the whole trip. By the way, have you ever used one of these film cameras?

Going back to our topic, let's take an organization's intranet as an example. If there's a storage limit that is reached by that intranet, the IT department is more likely to pay for a few more gigabytes of memory to be added. The alternative would be to scour the existing files and documents and data downloads, and whatever that intranet might contain in order to identify what's absolute and not considered useful anymore and purge it. In a time when storage is considered cheap, it can be more challenging to do all of that than pay a little bit more for some extra storage.

Conclusion

As long as these reasons exist, dark data is here to stay. But knowing of the reasons listed above and addressing them will help organizations reduce the amount of dark data that they hold. Well, I hope inexpensive storage is here to stay. 🙂

Share0

Tweet0

About the author

George Firican

George Firican is the Director of Data Governance and Business Intelligence at the University of British Columbia, which is ranked among the top 20 public universities in the world. His passion for data led him towards award-winning program implementations in the data governance, data quality, and business intelligence fields. Due to his desire for continuous improvement and knowledge sharing, he founded LightsOnData, a website which offers free templates, definitions, best practices, articles and other useful resources to help with data governance and data management questions and challenges. He also has over twelve years of project management and business/technical analysis experience in the higher education, fundraising, software and web development, and e-commerce industries.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
tve_leads_unique	1 month	This cookie is set by the provider Thrive Themes. This cookie is used to know which optin form the visitor has filled out when subscribing a newsletter.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_1Z635JPV9L	2 years	This cookie is installed by Google Analytics.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AE_AB_COOKIE	1 year	No description
DEVICE_INFO	5 months 27 days	No description
loglevel	never	No description available.
tl_4829_4830_26	1 month	No description
tl_4829_4840_30	1 month	No description
tl_4829_4941_41	1 month	No description
tve_secret	1 year	No description available.

5 reasons why we have dark data

1. Different priorities

2. Lack of data governance

3. Poor data quality

4. Constraints from tools and skills

5. Inexpensive storage

Conclusion

George Firican

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy

Transforming Marketing Data into Business Growth: Key Insights and Strategies

The future of generative AI’s form factor

You may also like:

How AI Is Reinventing MDM and Data Governance

From fragmented data to planetary-scale systems: why FSA/MEBS represents a step-change in enterprise modeling

Optimizing retail operations through a practical data strategy