The founders of Qubole, a Big Data start-up based in Mountain View, Calif., have built a Hadoop managed service whose ultimate goal is to empower data architects and data scientists by providing turnkey Hadoop infrastructure from the cloud. Co-founder Ashish Thusoo, who created Hive while part of the data infrastructure team at Facebook, says he wants to help companies up-level the Big Data conversation from talk of clusters and nodes to analytics and business insights.
Qubole sees two major bottlenecks to getting Hadoop deployments off the ground. First are the upfront costs for hardware and expert staff. Second is the complexity of deploying and maintaining Hadoop clusters and related software. Another issue is simply the lack of talented Big Data infrastructure and hardware pros.
Qubole uses the public cloud, specifically AWS, to abstract away the complexities of Hadoop infrastructure operations. Its service, called Qubole Data Service (QDS), starts with more than a dozen data connectors for importing and exporting data to and from the platform, including connectors for MongoDB, Google Analytics, HP Vertica and Oracle. The core of the service is Qubole’s managed Hadoop platform, which includes support for Hive, Sqoop, Pig and Oozie, as well as an SDK for building applications in Python. Data is stored in S3, Hadoop clusters include auto-scaling capabilities, and MapReduce jobs and SQL-style queries can be created via an interactive graphical user interface.
As a managed cloud service, Qubole handles all operations, meaning users don’t need to provision hardware, configure clusters or continually tune clusters to maintain performance. Clusters are spun-up only when jobs are kicked off and are automatically expanded or contracted based on the characteristics of given workloads. QDS also leverages AWS’s spot market, allowing users to bid on spot instances based on maximum willing price.
As for performance, Thusoo and his team have further optimized Hive, the open source data warehouse framework for querying Hadoop data, and have built IO optimizers for S3. The most recent addition to QDS is support for Presto, another open source project for performing interactive SQL-style analytics against petabytes of data that, like Hive, was created by Facebook. Facebook claims Presto is orders of magnitude more efficient than Hive and can return query results in milliseconds.
Qubole currently has more than 50 active accounts, though not all are paying customers. According to Thusoo, Quboles largest customers are supported by 1,000+ node clusters, but most use clusters between 20 and 50 nodes in size. In all, Qubole users process more than 12 petabytes of data per month.
Most users are from online-native companies, including digital marketing firms, SaaS vendors and online retailers. Users are generally (1) data architects who use QDS to integrate, transform and otherwise prep large data volumes for use by business analysts via data visualization tools like Tableau Software and (2) data scientists who perform large-scale data explorations and analysis within the platform itself. QDS is not designed to support interactive Big Data applications deployed to large numbers of concurrent business users.
The company is not just taking on Hadoop distribution vendors, whose software is almost always deployed on-premise, but other cloud-based Hadoop services, such as AWS Elastic MapReduce, that still require users to configure software and lack GUIs, high-level APIs and other features to simplify analytics and application development.
Qubole, founded in 2012, currently has 35 full-time employees and raised $7 million in a Series A round in April 2013 led by Charles River Ventures and Lightspeed Venture Partners. Its service is also available on the Google Cloud Platform.
The benefits of a Hadoop managed service like QDS is that it removes the need for upfront capital expenditure on hardware and the need to hire expensive and scarce Big Data practitioners. Users can focus attention on distilling actionable insights from data rather than on maintaining the infrastructure to support such analysis. But it requires users to move critical data to the cloud, where security and SLA standards do not always match expectations.
Action Item: Qubole’s challenge is to convince non-digital native companies in industries such as healthcare, financial services and manufacturing to trust their data to a small start-up operating in the public cloud. To the extent that Qubole can do so, it will increasingly be tasked with supporting customers with more complex hybrid on-premise/cloud environments. The value proposition for Qubole is clear, however, and Wikibon believes the public cloud is indeed an attractive environment for data scientists and analysts looking to perform large-scale exploratory analytics.
Footnotes: