Update: MapR and Google announced at Google I/O 2012 that MapR’s Hadoop distributions will be available on-demand via the new Google Compute Engine, validating Wikibon’s previous analysis. Pressure remains on Hortonworks, Cloudera, other Big Data vendors to shore up their cloud strategies.
For the company that invented MapReduce, Google didn’t have much of a presence in the commercial Big Data market until just last month (with the public release of BigQuery.) While Yahoo! engineers took Google’s concept and spearheaded the open source Hadoop movement, Google was happy to quietly develop its own Big Data platform for its own internal use.
And who can argue with the results. Since going public in 2004, Google’s market cap has risen from $23 billion to over $184 billion. The company dominates the search and online advertising business and is quickly becoming a major player in cloud services, mobile computing and even enterprise applications. As for Yahoo ….
Still, don’t mistake Google’s lack of first-mover status in the commercial Big Data space as an oversight or even a strategic decision to stay out of the Big Data market. Rather, Google is playing the long game and playing it well. So well, in fact, that Google is in as solid a position as anyone to capitalize on the exploding interest in Big Data for the enterprise.
Big-Data-as-a-Service Just a Start
When Google released its Big Data Platform-as-a-Service, called BigQuery, last month, the company touted its “time to insight” advantage over competing, on-premise Big Data platforms. The company boasted that customers could upload their data and be up and running Big Data Analytics via BigQuery in a matter of a day or two, versus weeks or more needed to build on-premise, custom Hadoop clusters.
Cloud-based services such as BigQuery certainly lower barriers to enterprise Big Data adoption. They eliminate the need to buy and provision hardware, to hire and train dedicated Big Data staff to keep clusters up and running optimally, and to deploy and tune Hadoop or other Big Data software. With a push of a button (or two), with Big-Data-as-a-Service enterprises can begin mining troves of data for new insights that could have profound effects on their businesses.
But Google isn’t the only vendor with such offerings. Amazon Web Services has its Elastic MapReduce cloud, where users can load data into S3 and begin crunching it with Amazon’s Hadoop distribution or, as of two weeks ago, either of MapR’s Hadoop distributions, M3 or M5. 1010Data has coopted the Excel model for its “Trillion Row Spreadsheet” as a service offering, and Hortonworks’ HDP is now available on Microsoft Azure.
Data, Data and More Data
No, Google’s Big Data advantage isn’t simply its cloud-based model, though that is a crucial component, as we’ll see. But the search giant has another strategic asset, one that gets to the heart of Big Data … the data itself.
As I’ve written before, if you’re not mashing up data sources, you’re not doing Big Data Analytics. Specifically, Big Data Analytics becomes exponentially more valuable once users begin mashing up and enriching internal transactional data with third-party data sources. It’s at this point that truly game-changing insights begin to emerge.
Google is nothing if not a treasure-trove of data. It indexes the entire web, knows what search terms are trending, understands user behavior, and runs millions of online ad campaigns. All this data is available to Google to sell to BigQuery customers and seamlessly integrate with internal transactional data for richer analytics.
Consider a pharmaceutical company looking to sell more flu-related treatments. With Google’s data alone, clinicians are able to track and predict flu outbreaks better than the CDC. Merge this data and analysis with the pharmaceutical firm’s internal sales, R&D and customer data and the possibilities to optimize its supply chain, create new products and identify new markets grow significantly.
All that data at Google’s fingertips, which it can easily make available to its BigQuery customers for an attractive price, is Google’s real Big Data advantage.
Now, back to the cloud.
Removing Big Data Adoption Barriers
Discussions with IT executives and other members of the Wikibon community indicate that harnessing third-party, multi-structured data for analysis is both a top area of interest and a top concern. While the benefits are significant, the hard part is grabbing all that data from the cloud and bringing it into internal Big Data platforms like Hadoop for further analysis.
Moving Big Data into corporate data centers is not a trivial undertaking. It requires licensing the data (in the case of proprietary data sets); moving potentially petabytes of data over the web; parsing, cleaning and otherwise transforming all this data; integrating it with existing data sets; and, only then, applying analytics to the data. This requires significant money, time, manpower and expertise. Yes, open source Hadoop is free to download and use, but impactful Big Data Analytics projects are not.
A cloud-based service like BigQuery removes many of these obstacles. With its data expertise and financial resources, Google is in prime position to license third-party data sets (that is, those data sets it doesn’t already own) at scale and create powerful but easy-to-consume services that make it simple for users to browse and purchase third-party data to combine with internal data sets. It can create analytics “best practices” to get new customers up and running quickly, and create a marketplace of third-party analytic applications that run on top of BigQuery.
The Long Game and Potential Risks
It will require time for Google’s Big Data strategy to play out, though. As noted on this blog and elsewhere, we are in the very early days of Big Data. While there are some high-profile exceptions, the majority of enterprises that have launched Big Data projects are still in the experimental phase. That is, enterprises typically begin Big Data with small deployments exploring their own internal data sets. Only once they become comfortable with the technology, believe they have a handle on their own data assets, and have identified a killer first use case do they proceed to the next stage in the process – incorporating third-party data.
As more and more enterprises enter this second phase of Big Data maturity, however, Google will be waiting to offer a compelling value proposition: for a monthly subscription fee Google will offload all the complicated (and unsexy) Big Data infrastructure and data management work, mash up your internal data with valuable outside data sources, and allow you to analyze and consume it in any number of user-friendly ways.
Meanwhile, Google is doing quite all right supporting itself with its other lines of business, meaning the company has the wherewithal to wait for Big Data to cross the adoption chasm before the enterprise customers begin flowing in.
Should Google pull off this Big Data coup, it will obviously put significant pressure on both the small but growing Hadoop ecosystem – particularly Hadoop distribution vendors Hortonworks, MapR and Cloudera – as well as mega-vendors like IBM and EMC that are also banking on Big Data as the next IT cash cow. Their best response would be to pursue Big-Data-as-a-Service offerings themselves, most likely by partnering with cloud service providers, potentially beating Google to the punch. MapR and Hortonworks have already taken steps in this direction, as has EMC Greenplum. Who knows, Google itself may even allow some of them to offer their Big Data platforms as a service from the Google cloud. Stranger things have happened.
There are other potential stumbling blocks to Google’s Big Data plans. Specifically, many enterprises are far from comfortable allowing sensitive data outside the corporate firewall for competitive advantage, privacy, and regulatory concerns. The case can also be made that, over the long term, public cloud deployments are actually more expensive than building internal infrastructures. And then there’s the data movement problem. Getting large volumes of data from internal data centers into the cloud and back often requires more than just an Internet connection, namely a large external hard drive and a FedEx account.
Finally, there’s the vendor lock-in question. As I wrote last month, Google’s BigQuery is not open source, unlike Apache Hadoop. Once an enterprise begins loading data into BigQuery and building applications on top of the platform, getting the data and analytics back out again and reusing the applications with Hadoop or another Big Data platform could pose a problem. There are also infrastructure integration concerns with a proprietary, cloud-based service.
Still, there will likely be a large and eager potential customer base for Google’s cloud-based Big Data platform who won’t be scared off by privacy or open source concerns, not least the tens of thousands of web developers that already use Google’s AppEngine. At Google I/O this week, the company has a number of sessions scheduled to show off the latest developments and partnerships associated with BigQuery. If Google successfully applies the “consumerization of IT” concept to Big Data, making it as simple to understand and use as its consumer search service, Google may well turn out to be the Big Winner in Big Data.