The definition of Big Data contains an important nuance that sometimes gets overlooked and obscures an important benefit of technologies like Hadoop.
First, the oft-missed nuance.
The term Big Data describes data sets whose size and type make them impractical to process and analyze with traditional database technologies and related tools.
Note the word impractical. It doesn’t say impossible. Why is this distinction important? Because it implies that traditional tools can process Big Data but in a way that makes the resulting output/analysis not worth the effort for one reason or another; and that new approaches – namely Hadoop and Next Generation Data Warehousing – overcome these limitations and make Big Data processing and analysis a practical reality for those organizations with the requisite infrastructure and expertise.
One of the two limitations of traditional tools is cost. Traditional data warehouses employ a scale-up approach, meaning in order to accommodate increasing data volumes you must increase the size of the box. This gets quite expensive when dealing with large data volumes on proprietary machines. Big Data technologies take scale-out approach. As data volumes increase, you simply add more servers and distribute the job across multiple commodity nodes.
The other limitation is speed. If money were no object, traditional data warehouses could theoretically scale-up to accommodate extremely large data volumes. But the speed at which they process and analyze data would still be prohibitively slow.
This brings me to the sometimes-obscured benefit that truly distinguishes Big Data technologies from traditional tools.
It’s not just the ability to handle petabytes of data in a cost effective way that makes Big Data technologies so powerful. It’s Big Data technologies’ ability to process large data volumes fast enough to allow users to employ an iterative approach to analysis and problem solving that is truly game changing.
Take the case of a telecom provider that wants to determine the cause of a spike in dropped calls on its network. A properly configured Hadoop cluster could allow that type of analysis in just a matter of hours, at which point the telecom provider could follow-up with subsequent queries to narrow-down the results until the root cause of the dropped calls is discovered.
The same analysis, which involves large volumes of usage data, might take weeks or more with a traditional data warehouse. By that time, any chance of resolving the issue before it causes significant harm has long since passed and there’s certainly no point in performing follow-up queries. So the analysis could be accomplished with traditional tools, but it would take so long as to render the results useless.
Remember this benefit the next time you’re educating your CEO about Big Data. Hadoop doesn’t just allow you to find answers to business questions buried deep within large volumes of data with less expensive hardware. It also does so fast enough so that you can follow-up with more queries to truly find the answer you need in time to take action.
Put another way, Hadoop lets you ask more of your data, more often.