In a recent interview with InformationWeek, Microsoft CEO Steve Ballmer claimed that IBM and Oracle don’t understand Big Data. For Ballmer and Microsoft, Big Data doesn’t depend so much on the size of the data, but on the type of data being processed and analyzed.
Specifically, for a data processing and analytics project to qualify as Big Data, it must encompass not just internal corporate data, but also third-party data that resides outside the firewall, according to Ballmer. He said IBM and Oracle limit their Big Data approaches to internal data, thus they are not in fact Big Data by his definition.
Microsoft, Ballmer told InformationWeek, makes it easy for SQL Server R2 Parallel Data Warehouse users to integrate third-party data via the Azure DataMarket for deeper, more nuanced analysis. That, says Ballmer, is true Big Data.
The term Big Data is a moving target, of course, but there are some core characteristics. Among them, as outlined by Wikibon’s CTO and Lead Analyst David Floyer, Big Data involves petabytes/exabytes of loosely structured, distributed data in flat schemas with few complex inter-relationships. Big Data often involves time-stamped events, such as data produced by log-files, sensors and social networks.
IBM, Oracle and now Microsoft are jockeying to position each of their approaches to Big Data as the industry standard, and Ballmer is clearly trying to steer the Big Data conversation towards Microsoft’s strengths and away from its weaknesses. That means talking up Microsoft’s ability to integrate third-party data with relatively large volumes of corporate data inside Microsoft’s SQL Server R2 Parallel Data Warehouse and away from its lack of petabyte-scale data processing power.
Nobody disputes that integrating third-party data with internal corporate data can add valuable context to the resulting analysis. But it’s not a new concept. Companies have been integrating third-party data from Dunn & Bradstreet, Dow Jones and other sources for years. It allows companies to understand relationships between, say, sales and financial market activity or how extreme weather events impact supply chains.
Cloud-based data marketplaces like DataMarket make it significantly easier for non-expert business users to explore and choose third-party data sources on their own, but the idea of integrating third-party data with internal data is neither an original idea from Ballmer nor one that is a pre-requisite for Big Data projects.
The ultimate expression of Big Data processing is, of course, Apache Hadoop. The open source framework allows for parallel processing of huge volumes of unstructured data distributed over many commodity machines. It is batch-oriented, not real-time, and requires significant expertise to deploy and tune.
Hadoop can be and is often used to process and analyze data that resides outside of corporate firewalls, but not always. A multi-national organization might use Hadoop to analyze data collected from thousands of sensors scattered around the globe in its many factories and warehouses. The data is internal to the company, yet nobody would claim this is not a Big Data job.
So why would Ballmer make such a claim? Perhaps because Microsoft’s answer to Big Data — the recently released SQL Server R2 Parallel Data Warehouse appliance – doesn’t fit a key requirement – the Big in Big Data. The largest R2 Parallel Warehouse deployment, as far as I can tell, is the 30+ terabyte deployment by Direct Edge stock exchange, as reported by Doug Henschen at InformationWeek:
The move from ECN to public stock exchange has brought rapid growth [for Direct Edge], with trading now generating about 2 terabytes of new data per month. Direct Edge has a conventional Microsoft SQL Server 2008 data warehouse built on clustered, high-end servers, “but we realized that we needed a platform that would scale to hundreds of terabytes,” said Direct Edge chief technology officer Richard Hochron in an exclusive interview with InformationWeek.
Instead of scaling up on ever larger and more expensive proprietary servers, the switch to PDW will enable Direct Edge to scale out on commodity x86 Intel servers using massively parallel processing–an approach now common to most data warehousing appliances.
30 terabytes of data is nothing to sneeze at, nor is scaling to “hundreds of terabytes,” as Direct Edge plans. But it’s not Big Data. That’s not to say Microsoft customers like Direct Edge won’t benefit from the R2 Parallel Data Warehouse appliance. The fact is, however, that Microsoft cannot, as of now, support truly innovative, cost effective petabyte-scale Big Data deployments.
(Interestingly, the Data Edge deployment doesn’t include integrating third-party data either, so it fails the Big Data test even by Ballmer’s definition.)
The good news for Microsoft is that Big Data approaches like Hadoop are complimentary to MPP-style data warehousing. Analysis discovered via Hadoop clusters can be integrated into MPP data warehouses like R2 for further analysis, for example.
The truth is, none of the large IT vendors – not Microsoft, Oracle, nor IBM — have figured out how to best leverage Big Data to their benefit. Some – EMC, for example – are farther along than others, while smaller vendors, most notably Cloudera, are farthest along in delivering enterprise-ready Hadoop distributions.
But the Big Dada vendor landscape is still very much an emerging ecosystem. For companies looking to experiment with Big Data, the best option today is to work with Cloudera’s open Hadoop distribution. It’s the most mature Hadoop distro out there, and benefits from the input of a large and active open source contributor community.
Those companies that want to keep their Big Data options open for the future – as I suggest all organizations do – ask your incumbent data warehouse vendors how they plan to integrate Hadoop and other emerging Big Data technologies with their existing data warehouse products. If you’re in the market for new data warehouse technology, make sure Hadoop connectivity is on your shortlist of pre-requisites, even if you don’t have any immediate Big Data plans.
Certainly integrating third-party data with internal corporate data is an important part of many large-scale data analytics projects, but it’s not a prerequisite for Big Data analysis, whatever Ballmer and Microsoft tell you.
(Below is a good Hadoop 101 from Cloudera’s Mike Olson via SiliconANLGE.tv.)