Over the last ten years, LexisNexis Risk Solutions has developed what it calls a High Performance Computing Cluster, a proprietary method for processing and analyzing large volumes of data for its clients in finance, utilities and government. The company this week made HPCC open source and spun-off HPCC Systems to develop and market the technology.
LexisNexis is positioning HPCC as a competitor to Apache Hadoop, the open source software framework for Big Data processing and analytics. The entry of LexisNexis and HPCC into the Big Data ecosystem is yet another validation of the Big Data space and should spur innovation from all parties – HPCC, Hadoop and others.
Whether HPCC is a viable competitor to Hadoop for Big Data dominance is another question. LexisNexis, which has vast experience in collecting and processing large volumes of media and industry data, certainly thinks it is. The answer, of course, depends on a number of factors, most of which are not yet clear. Here is my initial analysis:
- Maturity – Ten years in the making, HPCC has a three-year head-start over Hadoop, which was developed, more or less, in 2004. Since then, however, Hadoop has benefited from the contributions of thousands of developers via the Apache Software Foundation. HPCC was developed behind closed doors by an undetermined number of LexisNexis developers. If you subscribe to the notion that two heads are better than one, Hadoop is likely the more mature technology thanks to its open source heritage despite being a few years younger than HPCC.
- Programming language – While Hadoop is, overall, the more mature of the two Big Data technologies, HPCC may have the edge in some specific functional areas. LexisNexis claims HPCC’s programming language, Enterprise Control Language, enables “data analysts and developers to define ‘what’ they want to do with their data instead of giving the system step-by-step instructions.” Though it doesn’t call out Pig Latin, the most popular Hadoop programming language, specifically, LexisNexis is inferring ECL is the easier, faster of the two languages in which to create Big Data processing jobs. If true, this is could be an advantage for HPCC over Hadoop, whose main drawback is that it requires significant expertise to use.
- Real-time data analytics – HPCC’s Rapid Data Delivery Engine, a.k.a. Roxie, allows users to run real-time queries against HPCC. This appears to be an advantage over Hadoop, which is generally batch-oriented and used for “rear-view mirror” analysis. Upon closer inspection, however, though Roxie is able to return query results in under a second, the data it hits has already passed through Thor, HPCC’s data processing cluster. So the queries are in near real-time, but the data is not updated in real-time. In other words HPCC is not capable of real-time data analytics, as far as I can tell, so on this point it’s a wash. Facebook is working on some interesting real-time analytics jobs using Hadoop, however, which, if applicable to other use cases, could be a significant improvement for Hadoop and a differentiator over HPCC.
- Open source acceptance – Gaining acceptance by the open source community is not as easy as just joining the Linux Foundation, as HPCC has done. You have to put in your time, not to mention your code. HPCC Systems has made HPCC open source, but LexisNexis “will not release its data sources, data products, the unique data linking technology, or any of the linking applications that are built into its products. These assets will remain proprietary and will not be released as open source.” This will not endear HPCC to the open source community. In contrast, Hadoop has been open source for virtually its entire existence and has a hardcore following of dedicated contributors. In order for HPCC to benefit from the open source model, it needs to attract talented developers to contribute new, innovative features. It remains to be seen if the open source community will embrace HPCC.
This is just some initial analysis. Once (if) HPCC illustrates some proof points and successful use cases, the balance of power could change. We’ll just have to wait and see.
In terms of the Big Data big picture, HPCC creates another Big Data “fork.” That is, Big Data technologies are still in the development phase, and it is unclear which approach, including competing approaches within the Hadoop framework via commercial distributions from EMC and Cloudera, will eventually win out. The entry of LexisNexis adds another competitor to the picture, potentially lengthening the amount of time it will take for a particular Big Data approach to win out. This, understandably, makes companies that are interested in Big Data reluctant to choose one approach over the other until a dominant approach emerges. Nobody wants to get stuck with an expensive Betamax when everyone else is using VHS.
The benefits of increased competition in the Big Data space will, I think, outweigh the negatives, specifically a lengthy battle for supremacy. Increased competition will spur more innovation in less time than if Hadoop had no worthy foe. Any lag time created by a drawn-out Big Data war will be offset by the superior innovation it will likely lead to.
For companies that want to get started with Big Data and have the internal expertise to do so, I recommend experimenting with both community editions of HPCC and Hadoop. I wouldn’t make any investments in commercial versions of either technology until you’ve tried both out and thoroughly vetted them for your specific use cases. Even then, proceed cautiously, as it will be some time before the winner in the Big Data competition becomes clear.
Those companies that lack the internal resources to take advantage of Big Data now should still get engaged. Start thinking about how Big Data could help your business, either by improving operational efficiency, identifying new revenue opportunities, or in any number of other ways. Follow the developments in the Big Data space and reach out to companies with similar needs that are using Big Data technologies and learn from their experiences. That way, when a dominant Big Data approach emerges and the technology becomes truly enterprise-ready, you won’t get caught flatfooted.