A new open source project is attempting to reverse engineer Dremel, Google’s internal system for performing low latency, ad hoc, SQL-like queries on large volumes of multi-structured data.
Dremel is the technology behind BigQuery, Google’s Big-Data-Analytics-as-a-Service offering that went GA in May. The open source project, Apache Drill, is meant to compliment Hadoop and MapReduce, which was originally developed to batch process, store, and perform deep historical analysis on Big Data (and was also inspired by a Google project.) According to project committers, Drill is being developed to deploy on existing Hadoop clusters so that users don’t need to spin up new environments and move large volumes of data between the two. Drill can access and analyze data where it resides, such as in HDFS, the popular Hadoop file store.
To understand the nuances between the two technologies – Hadoop/MapReduce and Drill – Wikibon spoke with Apache Drill Committer and MapR Director of Project Management Tomer Shiran. He used the example of an online marketplace, such as the Android marketplace, to illustrate the distinct use cases. Where as a expert data scientist might use Hadoop to analyze years of marketplace usage data to find hidden customer behavior patterns, for example, a less sophisticated analyst could tap Drill’s SQL-like functionality to answer specific questions, such as, “What were the top 100 apps during the last quarter?” or “What time of day is most popular for app downloads?”
Further, the Hadoop/MapReduce job could take minutes or hours to return results, where as Drill produces query results in near real-time – seconds or less. And where Hadoop requires significant expertise from the user, Drill is aimed at regular business intelligence users, Shiran said.
Apache Drill is currently in incubator status, with a founding team of eight committers. These include Shiran, as well as MapR Co-Founder and CTO M.C. Shrivas; Vice President of Engineering at Big Data startup Drawn to Scale and Apache HBase committer Ryan Rawson; and Chris Wensel, CEO of Concurrent and developer of the Cascading application framework.
With two members of the MapR team making up a quarter of the project, Drill clearly enjoys significant support from that company. MapR currently sells a customized, high-performance Hadoop distribution that includes Direct Access NFS. The company has chosen to keep some of its Hadoop source code closed, resulting in criticism from some members of open source Big Data community. However, MapR is committed to maintaining Drill's open source status according to MapR Vice President of Marketing Jack Norris.
MapR: To Open Source or Not to Open Source
In a recent conversation, Norris and Shiran explained MapR’s philosophy in regard to open source. In those open source scenarios where the development of a technology has progressed to the point that making significant, enterprise-focused improvements to its foundation within the open source community would be too difficult or time-consuming, MapR prefers to bring development in house, as it did with its Hadoop distribution, M5. In such cases, MapR believes providing open APIs to the community is more important than open sourcing the code itself, Norris said. But when the company can get involved with a project early and provide input on the technology’s development from the ground up, such as with the nascent Apache Drill project, MapR is happy to work in the confines of the open source community, they told Wikibon.
MapR also enjoys a close relationship with Google. MapR is the only Hadoop vendor authorized by the search giant to offer its distribution via Google’s public cloud service, Google Compute Engine. That deal was announced at Google I/O earlier this summer.
It is very early days for Drill, however. Working within the confines of the Apache project, it could be months or years before Drill reaches 1.0 status. Still, Norris envisions a day when MapR includes support for Drill in its Hadoop distribution as it does now for Pig and Hive.
Speaking of Hive, Drill is not the first attempt at develop SQL-like, real-time query capabilities for Big Data. Hive is a data warehousing framework for Hadoop that was originally developed by Facebook for just such a purpose. While Hive has enjoyed limited success, MapR’s Shiran maintains that Hive is still too high-latency for real-time queries. There’s also Hadapt, a Boston-based startup that is developing its own Hadoop platform that includes native SQL capabilities. And MPP analytic database vendors such as HP Vertica, EMC Greenplum, and others have long positioned their databases as providing Hadoop environments with complimentary real-time query capabilities.
Action Item: Drill is yet another example of the remarkable innovation occurring in the open source Big Data community. Still, while promising, Drill is very much a work-in-progress, even compared to Hadoop. Organizations with immediate or short-term real-time Big Data query needs should consider alternate approaches such as MPP analytic databases or Google’s BigQuery service, while keeping a close eye on Drill as it progresses.
Footnotes: