Greenplum today announced a new Hadoop distribution called Pivotal HD. It is based on Hadoop 2.0 and integrates the Greenplum database with Apache Hadoop.
The move addresses a common concern among Big Data practitioners. While Hadoop has proven itself a scalable, cost effective Big Data storage and processing platform, analyzing data in Hadoop is a complex affair. It requires analysts skilled in writing MapReduce jobs in Java and who have a thorough understanding of distributed computing.
This limitation means Hadoop is not accessible to most business analysts and other skilled data professionals who are instead versed in SQL to access and analyze data stored in relational databases. To overcome this limitation, most Hadoop early adopters wishing to provide business analysts access to data in Hadoop rely on connectors to move data between Hadoop and existing relational database systems.
This approach has numerous drawbacks, including adding yet more complexity to already stressed data infrastructures and increased total time to insight.
By integrating the Greenplum database inside Hadoop, Pivotal HD reduces this need to move data between systems for analysis. Rather, analysts may perform queries using standard ANSI SQL on data stored in HDFS and expect near-real-time query results. Inclusion of the Greenplum database inside Hadoop also removes the delays associated with moving data between systems.
The key technological developments behind Pivotal HD, which allow for high-speed data loading and high-performance SQL queries against data stored in HDFS, are what Greenplum calls data pipelining and HAWQ.
Wikibon believes that for Big Data to live up to its promise, practitioners and vendors must move towards the development of a comprehensive Big Data platform/framework that:
- Facilitates all manner of workloads - analytic and transactional, batch and real-time;
- Stores and processes all of an organization's data - structured, unstructured and everything in between;
- And is useful to seasoned Data Scientists and regular business users alike.
Pivotal HD is a positive step in this direction and Greenplum should be applauded for putting forth a vision for the future of Big Data. With Pivotal HD, Greenplum is attempting to provide the leadership customers are craving to help them successfully navigate the transition to the Big Data Era.
That said, Pivotal HD is not a cure-all. In order to fully realize the benefits of SQL capabilities inside Hadoop, enterprises must first commit to Hadoop as the foundation of their data infrastructure. Few organizations have yet taken that step. Greenplum is clearly placing a big bet that most will. In most large enterprises, data is dispersed in numerous databases and storage platforms in numerous formats. Universal data quality and data governance standards are also sorely lacking in many organizations.
Pivotal HD also sacrifices some of the Greenplum database's functionality due largely to limitations HDFS places on traditional SQL processing. These include sacrificing data locality for data distributed throughout a cluster of nodes and support for append-only tables. Greenplum should address these and other limitations in future versions of Pivotal HD.
Action Item: CIOs must begin formulating both (1) a vision for how best to harness Big Data in their respective enterprises and (2) practical plans to navigate the transition. From a technology infrastructure perspective, this should include frameworks for scalable, comprehensive Big Data infrastructures to support all manner of workload. As such, Wikibon recommends CIOs seriously evaluate emerging Big Data products and services such as Pivotal HD that enable the integration of scalable, cost-effective data storage with both batch and SQL analytics, keeping in mind that further investments in technology, people and processes will be necessary.
Footnotes: See Wikibon's extensive catalog of free Big Data content at Wikibon.org/bigdata.