Apache Cassandra, the open source distributed database, is attracting significant attention from Web application developers and data management pros, both at innovative start-ups and the traditional enterprises. Last week 800 Cassandra users and developers got together at the third annual Cassandra Summit Santa Clara to discuss the latest trends and use cases for Cassandra, as well as to network and learn from one anothers’ experiences with the open source database.
Among the presenters were representatives from several big name enterprises that have deployed Cassandra to support mission critical real-time Web applications in production. They included the Walt Disney Co., which uses Cassandra as a central Data-as-a-Service hub supporting its various business units; Netflix, which taps the open source database to serve up streaming movies and TV shows to its customers; and eBay, which supports a number of production applications with Cassandra, including its “eBay Social Signal” initiative.
A number of Cassandra users also told their stories on theCUBE, including HealthCare Anytime, Hobsons and SourceNinja (as did the aforementioned Netflix.)
What came through in these discussions with end-users is that Cassandra is a mature, scalable, and reliable database that has come a long way over the last two years. While administering and managing Cassandra is still more complex than some competing NoSQL databases, Cassandra has won over many converts due to a number of technological advantages it has over rivals. Two of the most important are Cassandra’s:
- Ability to access and deliver data in near real-time. First and most importantly, Cassandra has proven itself capable of delivering near real-time performance to support interactive, Web-based applications at scale. It does this through a combination of its ability to store and access data in columns, its ability to perform extremely fast inserts, its use of distributed counters, and its ability to take advantage of solid-state drives.
- Ability to deploy across data centers. Cassandra can be deployed across multiple, geographically dispersed data centers to provide high-level redundancy, failover, and back-up & recovery capabilities. This includes highly granular controls over data replication across data centers to optimize for both performance and stability.
In addition to the technology itself, another positive Apache Cassandra has going for it is the open source community. The Summit demonstrated that the Cassandra community is highly sophisticated, extremely knowledgeable, and, above all, serious about deploying real-time Big Data applications today. We came across very few attendees with pie-in-the-sky ideas about what they might achieve. Rather, most had already deployed mission-critical applications with Cassandra and were investigating ways to increase the performance and manageability of the database to even higher levels. Cassandra Summit was not an event for NoSQL newbies.
Then there’s DataStax, the commercial Cassandra company and host of the Cassandra Summit. The company has contributed a number of its own value-add capabilities to the database, which DataStax makes available through its enterprise-level Big Data platform. These include advanced workload isolation/management, improved deployment and administration capabilities via DataStax OpsCenter, and the bundling of Cassandra with Apache Hadoop and Apache Solr. The latter development is increasingly important, as it allows users to deploy all three technologies on a single cluster, reducing CapEx and simplifying management & administrative duties.
Between the open source community and the team of engineers at DataStax – two groups that overlap, including DataStax Co-Founder and CTO Jonathan Ellis, who is also chair of the Apache Cassandra project – Cassandra continues to improve in both performance and scalability. More updates are due this October, when concurrent schema change and virtual node capabilities are expected to debut.
The NoSQL Competition
This is not to say that Cassandra and DataStax don’t have competition in the real-time Big Data market. HBase continues to enjoy strong adoption among developers, though its reliance on the Hadoop Distributed File Store and management complexity are potential obstacles to supporting real-time, interactive (non-analytics) applications. MongoDB is well regarded for its ease-of-use, but has shown it tops out performance-wise at relatively small data volumes.
Then there’s Apache Accumulo, an emerging open-source NoSQL database that proponents maintain can support both real-time applications and Big Data analytics. Sqrrl, a start-up that recently popped out of the National Security Agency (where the database got its start), is in the process of emerging from stealth to commercialize Accumulo. Its still early days, but keep an eye out for this one.
On the commercial front, DataStax competes with the Hadoop distribution vendors. Cloudera, which got out to an early start in the Big Data market in 2009, has built an impressive business in a relatively short period of time aimed mostly at Big Data analytics use cases. There are relatively few examples of Cloudera supporting real-time, interactive applications, however.
Hortonworks, which spun-out of Yahoo last year, is a more intriguing competitor due to its business model – the company charges for high-level technical support and training but makes its HDP platform free to download – and its leading role in developing YARN. Also known as NextGen MapReduce, YARN aims to allow Hadoop clusters to process data using frameworks other than MapReduce. This means, if successfully developed, YARN would make Hadoop adept at supporting real-time Big Data applications, graph analytics, enterprise-search-style functionality, and other data processing styles in addition to batch-oriented, large-scale Big Data analytics.
For now, though, Cassandra stands at the front of the NoSQL pack when it comes to supporting real-time, interactive (non-analytics) Big Data applications.
Action Item: Enterprises and start-ups developing Big Data applications – that is, applications that deal with large volumes of multi-structured data and require real-time data interactivity – should strongly consider using Apache Cassandra at the database layer. Cassandra has proven itself in numerous mission-critical production scenarios and continues to improve in performance, scalability, and management thanks to contributions from the community and commercial sponsors like DataStax. Further, enterprises that are also interested in expanding their Big Data capabilities to include batch analytics may be well served by DataStax’s three-pronged Big Data approach – bundling Cassandra, Hadoop, and Solr in one platform.
Footnotes: