In case you were unsure, I can confirm that the era of ‘Big Data’ is here to stay. Generating over 140 million Tweets a day, Twitter alone could keep the ‘Big Data’ moniker relevant on its own.
Companies of all types are eager to get their hands on data generated from Twitter, Facebook, blogs and other social media to better understand their customers. But they need help. There’s just too much data. That’s where Hadoop comes in.
In the pre-Twitter days, customer analytics basically consisted of loading some CRM and sales data into a data warehouse, slapping a business intelligence tool like Crystal Reports on top, and pumping out charts and graphs covering customer demographics and sales patterns.
Those days are gone. For one, data warehouses just weren’t designed to handle the huge volume of data social media currently generates. Nor are most data warehouses able to ingest or analyze unstructured data like text-based Tweets.
Hadoop might be the answer to both shortcomings.
“Hadoop is in many ways the closest thing we have right now towards an open standards framework for building out a content warehousing environment in which we can do essentially in-database analytics, text mining, and sentiment analysis on large volumes of both structured and unstructured data,” according to Forrester Research’s Jim Kobielus.
Developed largely by Yahoo! as part of the open source Apache project, Hadoop is a Java-based framework that can distribute huge volumes of unstructured data across multiple nodes for relatively fast analytics. Hadoop allows you to bring the code to the data rather than the other way around. It can process multiple petabytes of data. That’s a lot of Tweets.
There have been some instances of data warehouses reaching the petabyte-level thanks to advances in parallel processing (think eBay’s two massive data warehouses, one each run on Teradata’s and Greenplum’s platforms.) But not every enterprise has eBay’s resources or internal expertise for such a massive undertaking. And there’s still that problem of data warehouses not being able to process unstructured data.
So in addition to adeptly handling unstructured text, Hadoop has another benefit that could take part of the burden off internal IT departments: it is perfect for deployment in the cloud, Kobielus said. That means companies will be able to analyze petabytes of data, but they won’t have to integrate, cleanse and store it themselves.
The problem, as some see it, is that none of the IT mega-vendors – IBM, Oracle and Microsoft – have brought a commercialized version of Hadoop to the market. Of course, that’s the way some in the open source community like it, but companies that lack sophisticated data analysts and statisticians could probably benefit from a commercially supported Hadoop framework that’s relatively easy to provision.
Small innovators like Cloud era and Datameer are taking the lead in brining commercial Hadoop to the market, though, and IBM and Teradata are currently working on incorporating Hadoop into their existing ‘Big Data’ product lines.
So it may not be quite ready for mainstream adoption, but expect to hear a lot more about Hadoop in the coming months and years.
To get a better handle on Hadoop, check out this primer with Cloudera’s Mike Olson, SiliconAngle’s John Furrier and Wikibon’s Dave Vellante: