It’s generally accepted that any organization working with data needs to adopt data governance. Readily available frameworks, tools, and services can be adapted to the requirements and environment of your organization, yet when it comes to big data governance, the options are a bit more complex.
Organizations must adapt to big data in four important areas:
1. Data quality
As with traditional data, establishing data quality metrics that are aligned with business objectives enables you to quickly uncover data quality issues and establish remediation plans. Accuracy, completeness, validity, consistency, and integrity will still be present with big data, but there are additional data quality characteristics to be considered:
- Timeliness: Does data arrive on time? Does it meet a refreshing schedule? Does it meet the requirements for the time interval from collection to processing to analysis?
- Readability: Is the content and format easy to understand? Does it need to be ready for human consumption in its initial state?
- Authorization: Does using the data require certain rights or permissions and what limitations are there?
- Structure: Do you have the technology to transform unstructured data into structured data?
- Credibility: What is your confidence in this data?
This last point is particularly important. With any new big data set, you must step back and ask: “Given the context in which I want to use this data, what information about it do I require to have trust or confidence in this data?”
As an example, consider this external source: statistics about what people purchase at restaurants and the prices of menu items over the past five years. Who created the source data? What methodology did they follow in collecting the data? Were only certain cuisines or certain types of restaurants included? Can we identify how the information is organized and if there is any correlation at any level to information already available elsewhere? Has the information been edited or modified by anyone else? Is there any other way to check the veracity of this information?
2. Metadata management
Ideally, before starting to access big data, ensure your reference information architecture is updated to support big data concepts such as unstructured data streams. Taking a call center’s data as an example, there is useful metadata assigned to the call itself, such as the country of the caller. Different software has different ways of coding that fact, either as the full name of the country or as ISO-2 or ISO-3 codes (for a downloadable code list, please see “The single best strategy for improving your mailing addresses” article). Whatever it is, you need to ensure this new information is mapped to your organization’s established reference data.
The metadata management capabilities need to be enhanced to encompass relationships between data, people, processes, and data use. To ensure continuity, the metadata also needs to be paired and promoted with education and training programs.
3. Data stewardship
The complexity of big data is also reflected in its stewardship. Data roles such as data steward and data owner are not as clear with large data sets. For example, what department is responsible for clickstream data? Is it marketing — because that data tracks the engagement and reach of potential customers and marketing efforts? Is it finance — because they need to calculate the return on investment? Is it IT — because that department manages the infrastructure and may be responsible for ensuring the proper APIs and tools collect the data?
It’s not advisable to have multiple “owners” responsible for the same data, and with big data, roles may change as that data moves through your ecosystem as well as through its life cycle. Nonetheless, these roles should be well understood.
Organizations should:
- Identify stakeholders as soon as possible, but be prepared to refine and iterate as you go
- Establish timelines and regular checkpoints; begin to measure the area being governed with key milestones
- Assign clear accountability to ensure progress is made
- Ensure clear measurements are employed
4. Data retention
If you have not already done so, define how long your data is considered current and relevant, then archive everything outside that range. Consider this statistic from a 2016 Veritas Global Databerg report: 85 percent of the data an average organization stores is redundant, obsolete, or trivial. Data storage is, indeed, cheap, but in the context of big data, the storage cost is increased considerably. Organizations spend millions of dollars a year storing data they’ll never use. This is not just a failure of good business sense; it is a failure of data governance.
Note: Article originally published in TDWI Upside.