Data deduplication is an increasingly important aspect of storage technology

Although new product has been continuously shipped over the past two decades, the world of storage advancement has remained relatively stagnant, at least from a performance perspective.  According to PCWorld’s 50 Years of Hard Drives, the first 10,000 RPM disk was released in 1996 and the first 15,000 RPM disk released in 2000.  Since that time, storage companies have focused on density and capacity rather than on performance, leading to the need for an ever-increasing number of spindles—spinning disks in an array of arrays—in order to improve overall storage performance.  As a result of this eager march toward density, the primary metric by which storage has been measured has been as a function of capacity—dollar per gigabyte or dollar per terabyte, for example.

Although capacity has played a central role, storage performance can’t be overlooked, particularly as organizations seek to leverage more and more centralized technology through various virtualization initiatives.  However, these initiatives have created some new challenges and opportunities:

  • New performance challenges.  As more workloads are centralized—particularly workloads that are directly user-facing, such as virtual desktops—storage performance becomes an ever-increasing factor.  While server workloads might be able to “hide” behind somewhat lesser-performing storage, once workloads are exposed directly to the user, performance becomes even more important than it already was.  VDI, in particular, places new stressors on storage.  VDI introduces occasional massive spikes in storage need as users all boot virtual desktops simultaneously—a phenomenon known as a boot storm.
  • More homogenous workloads in some cases.  The number of workloads in the data center has burgeoned.  Between servers and desktops, there’s a lot more in the data center than their used to be.  This has created a situation in which many different workloads all look very similar—all of the VDI-based machines run the same operating system, for example.

Solving the performance challenges is hard work and can be expensive.  Companies have to build their virtual environments around very high performance standards while, at the same time, maintaining the capacity requirements needed to meet business objectives.  As mentioned earlier, this can mean having to buy a large quantity of spindles just to meet basic needs.

What’s a storage architect to do?

Flash storage would seem to be the logical successor to the current lineup of spindle-based disks.  From a business perspective, the primary issue with flash-based is the cost.  From this article at Pingdom, we learn that, in 2011, the average per gigabyte cost for solid state storage was $2.42 versus $0.075 per gigabyte for traditional magnetic media.  This cost differential is incredible… and not in a good way.

Next, solid state disks don’t haven nearly the capacity of their rotating cousins.  With spinning disks, 3 TB disks are available.  However, with solid state disks, the largest disk is measured in the hundreds of gigabytes.  This creates capacity challenges.  But all is not lost.

Some companies—such are Pure Storage—are embracing the solid state disk trend and using it as the primary building block for a new class of storage arrays that solves the performance problems while leveraging the centralization of similar workloads.  Knowing that businesses will balk at the price of raw SSD storage, Pure Storage isn’t just offering data deduplication as an optional feature.  A building block feature in Pure Storage’s offering, deduplication is an integral, necessary, standard component of the product and it allows the company to compete on price in a more equal way with traditional vendors.  The data deduplication feature in the Pure Storage FlashArray is one that doesn’t compromise on performance in order to achieve capacity efficiency. This feature is one of the company’s main selling points and allows it to claim that their storage solution carries a similar dollars per GB cost of traditional storage solutions.

Let’s see how the company positions this.

The Pure Storage FlashArray is an array with a base raw capacity of 11 TB.  Now, this is where storage pros will look at the solution and walk away since most organizations need far more space.  Because of Pure Storage’s inclusion of extremely efficient data duplication technology, the company advertises their array a little differently and includes a second metric called effective capacity.  For the 11 TB array, the company’s effective capacity figure is “up to 100 TB.”  Given the homogeneous nature of many workloads—think VDI—this is not an unreasonable expectation.

In addition, the array carries incredible performance specifications, providing 300,000 IOPS and 180,000 sustained write IOPS.  Obviously, the effective capacity will vary based on actual usage, but Pure Storage provides a tool called the Purity Reduction Estimator (PRE) Tool, which can be used to predict data reduction rates.

All of this takes massive CPU power to accomplish.  Processing performance is a metric that has kept pace with Moore’s Law in a big way and today’s processors are incredibly overpowered for many of the tasks sent their way.  By leveraging this processing abundance and capitalizing on new trends in data center workloads, the market is ripe for upstart vendors such as Pure Storage to make inroads into a storage market that has been focused on capacity and spindle-driven performance.

Personally, I see good things for company’s like Pure Storage.  These newcomers are turning the storage paradigm upside down and focusing on performance and using today’s massive processors to solve the capacity question through revolutionary deduplication techniques.  This is an area that deserves continued attention.