LexisNexis HPCC Takes On Hadoop as Battle for Big Data Supremacy Heats Up

Over the last ten years, LexisNexis Risk Solutions has developed what it calls a High Performance Computing Cluster, a proprietary method for processing and analyzing large volumes of data for its clients in finance, utilities and government. The company this week made HPCC open source and spun-off HPCC Systems to develop and market the technology.

LexisNexis is positioning HPCC as a competitor to Apache Hadoop, the open source software framework for Big Data processing and analytics. The entry of LexisNexis and HPCC into the Big Data ecosystem is yet another validation of the Big Data space and should spur innovation from all parties – HPCC, Hadoop and others.

Whether HPCC is a viable competitor to Hadoop for Big Data dominance is another question. LexisNexis, which has vast experience in collecting and processing large volumes of media and industry data, certainly thinks it is. The answer, of course, depends on a number of factors, most of which are not yet clear. Here is my initial analysis:

  • Maturity – Ten years in the making, HPCC has a three-year head-start over Hadoop, which was developed, more or less, in 2004. Since then, however, Hadoop has benefited from the contributions of thousands of developers via the Apache Software Foundation. HPCC was developed behind closed doors by an undetermined number of LexisNexis developers. If you subscribe to the notion that two heads are better than one, Hadoop is likely the more mature technology thanks to its open source heritage despite being a few years younger than HPCC.
  • Programming language – While Hadoop is, overall, the more mature of the two Big Data technologies, HPCC may have the edge in some specific functional areas. LexisNexis claims HPCC’s programming language, Enterprise Control Language, enables “data analysts and developers to define ‘what’ they want to do with their data instead of giving the system step-by-step instructions.” Though it doesn’t call out Pig Latin, the most popular Hadoop programming language, specifically, LexisNexis is inferring ECL is the easier, faster of the two languages in which to create Big Data processing jobs. If true, this is could be an advantage for HPCC over Hadoop, whose main drawback is that it requires significant expertise to use.
  • Real-time data analytics – HPCC’s Rapid Data Delivery Engine, a.k.a. Roxie, allows users to run real-time queries against HPCC. This appears to be an advantage over Hadoop, which is generally batch-oriented and used for “rear-view mirror” analysis. Upon closer inspection, however, though Roxie is able to return query results in under a second, the data it hits has already passed through Thor, HPCC’s data processing cluster. So the queries are in near real-time, but the data is not updated in real-time. In other words HPCC is not capable of real-time data analytics, as far as I can tell, so on this point it’s a wash. Facebook is working on some interesting real-time analytics jobs using Hadoop, however, which, if applicable to other use cases, could be a significant improvement for Hadoop and a differentiator over HPCC.
  • Open source acceptance – Gaining acceptance by the open source community is not as easy as just joining the Linux Foundation, as HPCC has done. You have to put in your time, not to mention your code. HPCC Systems has made HPCC open source, but LexisNexis “will not release its data sources, data products, the unique data linking technology, or any of the linking applications that are built into its products. These assets will remain proprietary and will not be released as open source.” This will not endear HPCC to the open source community. In contrast, Hadoop has been open source for virtually its entire existence and has a hardcore following of dedicated contributors. In order for HPCC to benefit from the open source model, it needs to attract talented developers to contribute new, innovative features. It remains to be seen if the open source community will embrace HPCC.

This is just some initial analysis. Once (if) HPCC illustrates some proof points and successful use cases, the balance of power could change. We’ll just have to wait and see.

In terms of the Big Data big picture, HPCC creates another Big Data “fork.” That is, Big Data technologies are still in the development phase, and it is unclear which approach, including competing approaches within the Hadoop framework via commercial distributions from EMC and Cloudera, will eventually win out. The entry of LexisNexis adds another competitor to the picture, potentially lengthening the amount of time it will take for a particular Big Data approach to win out. This, understandably, makes companies that are interested in Big Data reluctant to choose one approach over the other until a dominant approach emerges. Nobody wants to get stuck with an expensive Betamax when everyone else is using VHS.

The benefits of increased competition in the Big Data space will, I think, outweigh the negatives, specifically a lengthy battle for supremacy. Increased competition will spur more innovation in less time than if Hadoop had no worthy foe. Any lag time created by a drawn-out Big Data war will be offset by the superior innovation it will likely lead to.

For companies that want to get started with Big Data and have the internal expertise to do so, I recommend experimenting with both community editions of HPCC and Hadoop. I wouldn’t make any investments in commercial versions of either technology until you’ve tried both out and thoroughly vetted them for your specific use cases. Even then, proceed cautiously, as it will be some time before the winner in the Big Data competition becomes clear.

Those companies that lack the internal resources to take advantage of Big Data now should still get engaged. Start thinking about how Big Data could help your business, either by improving operational efficiency, identifying new revenue opportunities, or in any number of other ways. Follow the developments in the Big Data space and reach out to companies with similar needs that are using Big Data technologies and learn from their experiences. That way, when a dominant Big Data approach emerges and the technology becomes truly enterprise-ready, you won’t get caught flatfooted.

, , , ,

  • Christopher Albee

    Jeff has made some solid and insightful points concerning HPCC. Speaking strictly for myself, I suspect strongly that the different strengths inherent in both Hadoop and HPCC will eventually reveal which Big Data solutions each technology is best suited for.
    Speaking more as a member of the HPCC community, I can add to Jeff’s thoughts: o  Concerning Maturity: HPCC is a mature technology having no rival for handling Big Data *in accordance with the business needs of LexisNexis Risk Solutions*. We recognize that Hadoop is a great technology and has benefitted from the efforts of its community members by virtue of having become open source. We hope to see similar innovation in HPCC. o  Concerning the Programming Language: ECL ain’t for sissies. Since it is a declarative language, it requires a different mindset, and someone who wants to use it who has no such experience must devote herself to a bit of a learning curve. I’ve programmed in ECL for five years, and I can say truthfully that the rewards are great: it is a mature, concise language and fully featured. And, because it has been developed specifically to work with Big Data, it allows the developer to think about the data rather than the CPU. o  Concerning Real-Time Data Analytics: By and large, Jeff is absolutely correct. However, the capability exists in the technology to query one or more real-time data sources. Notwithstanding, based on my work in the technology, I might assert that the apparent lack of Real-Time analytics aids in the speed by which HPCC has proven its effectiveness. True, all query-able data is ingested through Thor first, but this is what makes Roxie so fast. Like so many other things in IT, it’s a trade-off. o  Concerning Open-Source Acceptance: Our proprietary assets are just that: proprietary. Can’t put them out there. But, there is some work being done in our off-time with publicly available (non-proprietary) sources to demonstrate the utility of HPCC and provide code samples whose style one would easily find in our repositories. I’ve contributed one such sample; we’ll see if it makes it out to the hpccsystems.com Contributions page. :-)

  • Jeff Kelly

    Hi Christopher. Great insights. Glad to hear LNRS will be contributing code samples, as I think that’s key to the open source model. I also take your point that HPCC is in fact a mature technology in the LN context. All in all, I think another Big Data approach is a good thing for the ecosystem as a whole.

  • http://twitter.com/Neuromancer Maurice Walshe

    yep I need an oreily book on ECL (well when I can get some time to try to get my head round it)

  • Pingback: Big (Data) Impressions from Strata NYC | SiliconANGLE

  • hpcc hater

    Dont buy into the LexisNexis propaganda. I use HPCC and ECL everyday. I also use hadoop everyday. HPCC sucks. Its horrific. The entire team loathes it. ECL is cryptic, difficult to understand, poorly documented and buggy. Why they didnt just stay with a SQL based language is beyond me. ECL has tons of bugs through out the platform. The back end is primarily single threaded. Causes jobs to back up for necessary lengths of time. Its not robust at all. If one node crashes, the whole thing crashes. The only folks that have it working well, are the jack asses in Boca. They have no problems coming out to fix your platform for a couple million dollars. I wish this platform would just die already.

    Dont waste your time. If you want to do real big data, stick with Hadoop. Its awesome.

  • Christopher Albee

    I understand your frustration. HPCC as a technology requires a lot of care on the back-end with respect to hardware. In a multi-node supercomputer, nodes fail. It happens. So, you swap out the node and keep going. You bring up an important point though: to use the HPCC technology effectively you need some fairly bright bulbs on your staff from ECL programmer to network admin. The technology presents some challenges that require a bit of extra effort to overcome. It sounds like you and your engineering staff either had the technology forced on you, or you decided to try it expecting it to work like any other RDBMS; in either case I understand why you are venting. But with respect to LexisNexis propaganda, I have to take issue: I’m not a propagandist, I’m a software engineer. I’ve worked with many technologies over the last 15 years, and I’m not so overcome with how great HPCC is that I don’t recognize its problems and issues. I had to code around one such issue this week. Notwithstanding, it’s a viable technology in a market that has plenty of room for different big-data solutions. Regarding your comment about the ECL language being poorly documented and buggy? Please see http://hpccsystems.com/download/docs for tons of well-written and complete documentation. And lastly, your assertion that ECL is cryptic and difficult to understand: see my earlier comment about sissies.