Hadoop is often touted as an inexpensive, scalable approach to Big Data Analytics. But even with commodity hardware, the expenses can start to add up.
More to the point, most Hadoop deployments don’t make very good use of the servers that populate clusters. This isn’t such a big deal in small, PoC Hadoop deployments, but when PoCs transition to large-scale production deployments with hundreds or thousands of nodes, enterprises start to feel the effects of server inefficiency where it hurts – the wallet.
In this sense, Hadoop is no different than any other IT project – you want to get your money’s worth and that means squeezing the most efficiency from your hardware. That’s no easy task in Hadoop environments, where overall hardware utilization often struggles to reach the 20% mark.
Virtualizing Hadoop
This sure sounds like a scenario tailor made for virtualization, and indeed VMware recently rolled out what it calls Project Serengeti that allows enterprises to run multiple Hadoop clusters (running multiple Hadoop distributions) on virtual infrastructures. The goal is to get the most productivity out of each node while giving administrators the ability to spin-up new Hadoop clusters in just minutes, according to VMware.
In addition to quickly bringing Hadoop clusters online when needed and offline again when they’ve outlived their usefulness, Serengeti also allows users to deploy the open source framework on existing infrastructure and provides automatic recovery capabilities, according to the company.
There is much to like here. As my colleague David Floyer wrote just last month, “[Serengeti] users should expect significant savings from more efficient use of compute resources, faster provisioning, and sharing servers across multiple workloads, and from the use of a familiar management platform.”
But there’s a catch. To take advantage of Serengeti, each node in a Hadoop cluster must run vSphere Enterprise Edition, which adds significantly to the overall cost. Again, as Floyer points out, virtualization overhead can be extensive in Big Data scenarios, meaning users will be forced to pay the “virtualization tax.” This means small “sandbox-style” Hadoop deployments on Serengeti will probably make financial sense, but the numbers may not work in large-scale deployments.
Rush for Hadoop
But there’s more than one way to skin the Hadoop inefficiency cat. Data integration vendor Pervasive Software has taken a different approach, namely by developing a general-purpose software framework it calls DataRush. The Java-based framework automatically parallelizes Hadoop jobs built on the platform with vanilla JVMs and standard commodity hardware, according to David Inbar, Pervasive’s Senior Director of Business Development and Strategy. A related offering, called RushAnalyzer, significantly speeds up Big Data pre-processing, Inbar said.
Pervasive began initially using the technologies inside its suite of data management products but has since made the two products directly available to customers. While Inbar wouldn’t reveal specific pricing, he said Pervasive’s subscription license costs “a tenth to a fifth” of what SAS or SPSS customers typically pay for similar functionality.
Thus far, DataRush and RushAnalyzer have garnered significant attention from systems integrators and consultancies, including Opera Solutions, Inbar said.
Action Item: CIOs and data center administrators cannot afford to overlook hardware inefficiency in Big Data scenarios. Hardware inefficiency becomes a significant cost creator when Hadoop deployments are rolled out into large-scale production. While inexpensive in relative terms, large Hadoop clusters that span thousands of nodes can still cost hundreds of thousands of dollars or more. Whether through hardware virtualization or a software-centric approach, deriving the most value from each node becomes paramount. The key is that the cost of improving hardware efficiency is outweighed by reduced costs of the hardware itself (either achieving the same performance on fewer nodes or increasing performance on the same number of nodes) and/or by related analytics productivity boost.
Footnotes: