When talking about Big Data, the conversation tends to focus on Data Science and analytics. That is, the stories about Big Data that hit the front pages of the mainstream press and the hallway conversations taking place at events like Strata are mostly about all the cool new ways to use data to greater effect.
But Big Data Analytics doesn’t take place in a vacuum. It takes place in the enterprise. And any time you mix data and the enterprise, you can’t afford to ignore data management best practices. It may not be as sexy as predictive analytics, but failure to apply fundamental data management best practices to Big Data projects can lead not just to failed projects, but to potential legal consequences as well.
Specifically, Big Data practitioners must consider:
- Data Quality – How accurate, complete and reliable is the data in question? In traditional data management scenarios, this meant ensuring customer names and addresses were accurate and up-to-date, for example. In Big Data scenarios, things grow more complex. Does this Twitter handle for @JohnRSmith refer to my customer John R. Smith? Do these IP addresses and mobile app log data correlate to the user or users I think they do?
- Data Governance – What are acceptable uses of data? Who is authorized to analyze particular data sets? When should data be disposed of? In short, data governance refers to a comprehensive, predetermined set of policies to govern the entire data management lifecycle. Data governance is particularly important in highly-regulated industries, where the improper use of data can result in legal action. Big Data poses particular challenges to data governance. Much of the value of Big Data comes from merging disparate data sources, creating yet new data sources. How should these new data sources be governed and who should be allowed to analyze them? And if analysis results in “sensitive” data, which privacy safeguards need to be applied? And on and on.
- Data Stewardship – Who “owns” a particular data set or data source? Data stewards typically are responsible for applying agreed metadata definitions to data sets and ensuring the accuracy of the data for particular use cases. In traditional environments, a product manager naturally would be the data steward for a product database. But in Big Data scenarios, who owns data streaming in from sensors on products in the field?
Unfortunately, data management best practices have been largely missing from the Big Data conversation. Though not entirely. Speaking at the aforementioned Strata Conference last week, IBM’s Anjul Bhambhri was one of the few attendees that touched on the issue live on theCUBE. She said:
“So when you look at all these different types of data that enterprises are trying to bring in – especially something like social data – there’s a lot of noise in that data. Not everything is relevant to a business. And to be able to filter and really get what they care about is important. And as things are going in production mode, enterprises have to be able to be confident about the quality of the data, be able to look at the lineage of the data and where it is coming from. … The same principals from the standpoint of data governance that apply to the structured world definitely apply to Big Data.”
This is an issue that is only going to become more crucial to the enterprise as Big Data projects move from PoCs to true production deployments supporting mission critical applications and workloads. Once Big Data projects actually start touching consumers, partners, and the outside world, data management best practices will in part determine their success or failure.