Something’s Missing from the Big Data Conversation

When talking about Big Data, the conversation tends to focus on Data Science and analytics. That is, the stories about Big Data that hit the front pages of the mainstream press and the hallway conversations taking place at events like Strata are mostly about all the cool new ways to use data to greater effect.

Screen Shot 2013-06-03 at 8.19.42 PMBut Big Data Analytics doesn’t take place in a vacuum. It takes place in the enterprise. And any time you mix data and the enterprise, you can’t afford to ignore data management best practices. It may not be as sexy as predictive analytics, but failure to apply fundamental data management best practices to Big Data projects can lead not just to failed projects, but to potential legal consequences as well.

Specifically, Big Data practitioners must consider:

  • Data Quality – How accurate, complete and reliable is the data in question? In traditional data management scenarios, this meant ensuring customer names and addresses were accurate and up-to-date, for example. In Big Data scenarios, things grow more complex. Does this Twitter handle for @JohnRSmith refer to my customer John R. Smith? Do these IP addresses and mobile app log data correlate to the user or users I think they do?
  • Data Governance – What are acceptable uses of data? Who is authorized to analyze particular data sets? When should data be disposed of? In short, data governance refers to a comprehensive, predetermined set of policies to govern the entire data management lifecycle. Data governance is particularly important in highly-regulated industries, where the improper use of data can result in legal action. Big Data poses particular challenges to data governance. Much of the value of Big Data comes from merging disparate data sources, creating yet new data sources. How should these new data sources be governed and who should be allowed to analyze them? And if analysis results in “sensitive” data, which privacy safeguards need to be applied? And on and on.
  • Data Stewardship – Who “owns” a particular data set or data source? Data stewards typically are responsible for applying agreed metadata definitions to data sets and ensuring the accuracy of the data for particular use cases. In traditional environments, a product manager naturally would be the data steward for a product database. But in Big Data scenarios, who owns data streaming in from sensors on products in the field?

Unfortunately, data management best practices have been largely missing from the Big Data conversation. Though not entirely. Speaking at the aforementioned Strata Conference last week, IBM’s Anjul Bhambhri was one of the few attendees that touched on the issue live on theCUBE. She said:

“So when you look at all these different types of data that enterprises are trying to bring in  – especially something like social data – there’s a lot of noise in that data. Not everything is relevant to a business. And to be able to filter and really get what they care about is important. And as things are going in production mode, enterprises have to be able to be confident about the quality of the data, be able to look at the lineage of the data and where it is coming from. … The same principals from the standpoint of data governance that apply to the structured world definitely apply to Big Data.”

This is an issue that is only going to become more crucial to the enterprise as Big Data projects move from PoCs to true production deployments supporting mission critical applications and workloads. Once Big Data projects actually start touching consumers, partners, and the outside world, data management best practices will in part determine their success or failure.


, , , ,

  • Pingback: IBM’s CEO Says Big Data is Like Oil, Enterprises Need Help Extracting the Value | SiliconANGLE()

  • Pingback: Big Data Security Best Practices -

  • JoeLoveATL

    There are some harsh realities in business that make the output of Big Data Analytics less effective. The challenge to overcome these obstacles requires some serious soul searching by Executive Management.

    In no particular order and of course not a complete list:

    Cart before the horse – Data Governance is an afterthought
    Source data is provided AFTER departmental parsing
    Initial qualifiers, set by managers, that exclude data
    Data order, in a traditional structured database, based on assumptions
    Executives trump data analyst’s recommendation
    Answer is sometimes predetermined and it WILL be confirmed
    The current business model is not understood
    Exposed trends are ignored or denied
    Unfortunately, many times, the truth is not what is sought

    It is not always about technology. Many leaders who take pride in not understanding the technology make limiting decisions that, through their own admission, they don’t understand. Accepting the expert advice of an underling is what differentiates a Leader from a titular Executive. Properly governed Big Data needs a position of power and truth in a company that is not subject to being swept under the rug of hidden agendas.

    I expect many readers can relate to my comments as the story of their working lives, and others, may deny there are any valid assertions.

  • Pingback: Big Data Security Best Practices – Protegrity()