Don’t Let Misconceptions Limit Your Hadoop Options

Developers don’t get spooked easily. But in a market like Big Data, dominated by open source code and free software, the term ‘proprietary’ can send shivers down a developer’s spine.

The fear (or is it disdain?) is sometimes justified. No developer wants to get locked in to a platform that dictates which tools she can use, which data sources she can integrate, which hardware she must deploy or that makes switching to a competing platform too costly to justify.

But ‘proprietary’ doesn’t necessarily mean these things, and in fact the term itself can obscure the more nuanced reality. That is certainly the case with Apache Hadoop, the open source Big Data framework that serves as the foundation of many Big Data implementations. This confusion is preventing some enterprises from considering all of their Hadoop distribution options, potentially leaving significant business value and associated revenue on the table.

Eyes Wide Shut

As is well known, there are three Silicon Valley start-ups vying to win the commercial Hadoop market, each with their own spin on the open source framework. Hortonworks’ HDP, most everyone agrees, is a fully open source product. Cloudera’s CDH is also considered open source by most, though the enterprise edition of its management software, Cloudera Manager, is proprietary.

Then there’s MapR’s M5. It includes support for most of the popular open source components of the Hadoop stack, including Hive, Hbase and Flume. It is 100% API compatible with Apache Hadoop, meaning data can be relatively easily moved in and out of M5 compared to CDH and HDP, and developers can build applications atop the platform in either C or Java. Like Cloudera Manager Enterprise Edition, M5 includes proprietary cluster management software. But unlike Hortonworks and Cloudera, MapR supplements HDFS, the main storage layer in Apache Hadoop, with its own proprietary version of NFS, which it says delivers significant performance gains.

M5 often gets tagged with the label ‘proprietary’ due to this last feature, NFS. As a result, the vendor is sometimes left off developers’ shortlists. This is a shame, not because M5 is necessarily the ‘best’ or ‘most advanced’ Hadoop distribution, but because the exclusion is often based on a misunderstanding of the term ‘proprietary.’

Specifically in MapR’s case, because of the ‘proprietary’ label, some developers hold the misconception that M5 is a black box that restricts the types of application development tools it supports and/or makes it more difficult to migrate data out of the platform relative to competing Hadoop distributions. This is simply not the case, but means developers sometimes don’t bother to include M5 in PoCs. And in an immature but developing market like Hadoop, developers that don’t honestly weigh all of their options are doing themselves and their enterprises a disservice

Lose the Labels

Just like people, most Hadoop distributions aren’t all one thing or all another. Sure, Uncle Bob enjoys gardening on the weekends, but it wouldn’t be accurate to label him a horticulturist. Likewise, yes MapR employs a closed source storage layer in its enterprise Hadoop distribution, but that doesn’t mean it is proprietary from stem to stern. And what about CDH, specifically when customers bundle in Cloudera Manager Enterprise Edition? Does that mean CDH should be labeled ‘proprietary’?

The point is it doesn’t matter what you label them. In fact, developers evaluating Hadoop shouldn’t concern themselves with labels at all. What should matter to developers are the pros and cons of each distribution relative to the others, with added weight lent to those attributes and features most relevant to their particular circumstances.

That means enterprises that have been experimenting with Apache Hadoop for some time but have hit performance walls, for example, might decide the value of M5’s performance capabilities outweigh the inability to alter code at the storage layer (especially at enterprises that lack the internal HDFS expertise to do so anyway.) Conversely, developers that want to build their own version of HDFS based on the open source foundation and therefore require access to the code will likely consider M5’s NFS a drawback.

We could go on with more examples and scenarios, but the important takeaway here is that developers should not dismiss a potentially valuable technology or platform just because others have labeled it ‘proprietary’ (or any other label, for that matter.) Rather, developers should take the time and perform the due dilligence necessary to accurately compare and contrast all viable, competing Hadoop distribution options. In some cases M5 may come out on top. In other cases, CDH or HDP may wind up the best choice. The point is that to limit ones options due to ignorance or misconceptions is never the right choice.




  • Propreitary

    Jeff, would you not consider it an ignorant misconception to choose a closed-source filesystem over one that’s openly developed with orders of magnitude more developer horsepower behind it?

  • Jeff

    No, I would consider that a choice each developer must make based on their circumstances. It would only be an ignorant choice if he/she didn’t have all the information available regarding Hadoop options when making a decision. Yours is a valid argument, but all I’m saying is get the facts straight and then make the best decision for your enterprise.