In a recent blog post entitled: Ocarina Weighs in- Dedupe Ratios Do Matter, VP Carter George writes: The dedupe ratio measures against what’s left after you’ve deduped. The percentage measures against the size of the data before you dedupe. Both are valid measures. It’s also true that some solutions do a better job shrinking your data than others. Dedupe solutions that do a better job should be ranked higher when you are comparing solutions. That said, comparing the claims made on vendor websites is not a very good way to find out who can actually shrink your data better.
Why? Vendors lie.
I agree with Carter that Dedupe or compression or capacity optimization solutions that do a better job should be ranked higher. I also agree that vendors like to hype dedupe optimization ratios. The question is how to determine which solutions do a “better job.” I contend Dedupe optimization ratio is not the best measure.
For example, I’ll bet that on balance, despite the fact that Data Domain hypes dedupe ratios, its dedupe rates are higher than Ocarina’s. So should Data Domain be ranked higher? Not necessarily – because Data Domain is used in a different use case than Ocarina. Data Domain is all about backup where the opportunities to reduce capacity are greater than in primary storage or archiving use cases, generally.
When I applied this thinking to primary storage I realized capacity optimization technologies for primary storage are not created equally. Wikibon research suggests that inline compression is far more efficient and effective than alternatives for primary storage. What does that mean? It means that for the cost of the solution you get way more bang for the buck—if applied in the right use cases. By ‘way’ I don’t mean a little bit…I mean as much as two orders of magnitude better. Maybe this is completely intuitive to you but it blew me away when I realized what was happening in this technology space.
How did I come to this conclusion? The answer is CORE.
Update – 4-21-10:
I’ve received lots of feedback on CORE from a variety of sources that I respect. Here’s a summary:
*You need to include decompress rates as well.
*CORE too heavily weighs performance and as such is seriously flawed.
*The piece is seriously biased toward Storwize – its competitive advantages are not as high as CORE suggests.
*You need to do a full TCO and include operational costs.
My take – including decompress rates might change the conclusions; especially for ASIS so we should do that.
2 & 3 are closely related imo. Hard to believe any company has that much of a competitive advantage, isn’t it? But if you could embed something like Storwize into an array (a la ASIS), make it super fast and invisible…I think that would potentially change the game. I haven’t concluded (yet) that performance is less critical than the CORE calculation implies…but I’m very open to improving the calculations. I think that should be the area of discussion. To me…#3 above says that performance is not as important as CORE implies. My fundamental question is “why not?” If time is money it would stand to reason that, in primary storage anyway, performance is absolutely critical. Where I admittedly struggle with this is a) ASIS is ‘free’ and b) what Ocarina does with different file types is really interesting and clearly adds value. The fact that you can do both during a batch window should somehow be factored.
On full TCO – I agree – that would be useful.
Background – CORE
Wikibon has developed a model to evaluate the effectiveness of data reduction technologies.
The model uses the concept of CORE, which stands for Capacity Optimization Ratio Effectiveness. It is a measure of the effectiveness of a storage optimization technology as a function of time and cost to achieve a desired capacity reduction. The bottom line goal was summarized by Wikibon member Mike Davis who reviewed the methodology and said:
…the goal of your proposal is to be able to rank ROI as defined by (marginal benefit)/(marginal cost).
Ask a CIO what’s more important – ROI or Dedupe ratios.
So we came up with some math to reflect that basic ROI concept and ended up with CORE. We vetted the idea to the Wikibon community and received some excellent feedback.You can read a full description of the math for CORE on Wikibon but basically it goes like this:
For a given capacity reduction technology, CORE is the capacity being reduced (S) times the percent reduction achieved (R) times the value of the capacity being saved (V) divided by the cost of the solution doing the reducing (C) times the elapsed time to compress the capacity (tc).
Here’s the formula:
CORE = (S X R X V) ÷ (C X tc)
The bottom line of CORE is the higher the number the more efficient and effective the technology…and the higher the ROI.
Warning…these are rough figures and haven’t been fully tested. But they are incredible nonetheless. If you want to change one of the assumptions in the model make a comment in this blog or email me and I’ll happily re-run the figures. But I honestly don’t think it will change the conclusions dramatically.
We started with a primary storage use case which is a hot area of discussion these days and looked at eight different data reduction technologies that vendors claim to be appropriate for primary storage (see Table 1). We assumed a 100TB target capacity to compress. In addition, the table below shows our assumptions around % capacity reduction (Carter’s dedupe ratio), the time to compress (because time is money), the cost of the solution and, the resulting CORE (which is a measure of business value).
[UPDATE: 4/28/2010 - Primary storage is definined as on-line active data with a specified 'typical' read:write ratio; where a user writes data to a persistent medium and can read back that data tranparently. Transparency in this definition means when writing and reading data there is no disruption to a user's application experience at any time.]
Table 1: Effectiveness of Capacity Optimization Technologies for Primary Storage
*The higher the CORE the more effective the technology
*Time to compress values are rough estimates for the unit of capacity specified in the assumptions below—it has the single biggest impact on CORE
*NetApp doesn’t charge for ASIS – we took a percentage of the array’s cost
Here are the additional assumptions used in the model.
Table 2: Assumptions Behind the CORE Model
What Does this All Mean?
First it’s important to remember the use case here is narrowly defined as primary storage. I really hadn’t comprehended this whole space until we went through this analysis. Vendors brief you and they position their products as targeting primary storage and generally it makes sense. Given that caveat, here’s my take:
- There is only one technology on this map (Storwize) that I would consider appropriate for true primary storage—i.e. active data.
- The rest are really focused on stale data or secondary storage.
- I believe the reason is that if you tried applying these technologies to active data you would run into some serious performance problems; which is why most of them operate post-process; and that is clearly demonstrated by this model.
There are several caveats to this analysis starting with time to compress. I used rough estimates in terms of time to compress a file. But even if I’m off base by a fairly wide percentage, it really doesn’t matter because in-line processing is much faster than alternatives. This is substantially what drives the CORE values along with the other factors cited.
Another caveat is NetApp’s ASIS and ZFS. Both are ‘free,’ meaning NetApp and Sun/Oracle don’t charge additional fees for the feature—making its ROI virtually infinite. But as we know nothing in technology is really free and so we had to make some assumptions about the actual cost of the feature when it’s being applied (in terms of resource consumption).
As well, most vendors on this list are careful about how they position their products. For example, Permabit is really focused on archiving use cases but the company’s technology can be applied to primary storage to increase the data reduction potential. Permabit has a “dedupe everywhere” mantra but its main focus is on archiving use cases; as is the case for most vendors on this list. The difference is Permabit is clear about that in its marketing.
It’s Not Only About the Reduction Rate
You can see this by looking at the following diagram. The chart shows capacity reduction rate on the X-axis and CORE on the vertical axis. While the technologies vary quite dramatically in terms of dedupe or compression ratio, from a business value standpoint it really doesn’t matter. Why do I say that? Because when you factor in reduction ratio, cost, speed and efficiency – the real measures of ROI – wide swings in capacity optimization ratios have little impact on value (CORE). That’s bizarre but unless you don’t care about disruption to your IT shop, it’s true.
Notice that most technologies cluster around a relatively low CORE (i.e. a CORE value of less than 250)– although ZFS’s CORE is higher because compression is built directly into the I/O pipeline. But even ZFS is still too slow when considering the impact to the performance of applications. In the case of Storwize, the only real time technology we evaluated, the CORE is off the charts.
The bottom line is despite the way some vendors position their products, most should not be applied to what I think of as primary storage. This doesn’t mean there’s anything wrong with the technology—it just means it’s not appropriate for true primary storage applications.
Figure 1: In-line Compression is Orders of Magnitude “Better” For Primary Use Cases
I would say based on this analysis that any solution with a CORE of less than 1000 should not be considered for real time primary storage. The only technologies I know of that are truly real time are Storwize and HIFN. There may be others and I need to do more research on this topic to develop a broader perspective. I’d be interested if readers of this post know any additional technologies that perform real time compression other than these two.
Additionally, as many in the Wikibon community have suggested, this analysis could go much deeper and include downstream effects on: a) other use cases like backup; b) operational costs (e.g. energy consumption) and c) performance and other capacity impacts. I agree. If you did that you would really get a better picture of the ROI, especially as you add in factors such as the ability to support heterogeneous storage.
The Winzip Factor
The performance issue is why I included Winzip. Everyone uses Winzip and while the inclusion of the technology in this analysis is hypothetical (i.e. you’d never use Winzip to compress 100TB) it is instructive to think about how Winzip works. You have a file to compress and you need space on the disk to compress and uncompress. If you don’t have the space you can’t use the tool. If you try to compress a file and you don’t have enough disk space you’ll get this error message:
In the case of Winzip the user doesn’t mind so much because you’re really compressing the files temporarily to store them and perhaps move them. Nonetheless, if the point is to save money by reducing capacity, one has to wonder: “If I need the disk space available to be able to decompress the files, how much money am I really saving on disk space?”
The obvious answer is I’ll never need to decompress all the compressed files that I have at the same time so as long as I have enough space available for the ones I want to bring back, I’m saving money. But users should be careful to understand the overheads associated with various capacity reduction techniques.
Capacity optimization for primary storage is becoming more mainstream; and there are many solutions to consider– including several we have not assessed here such as NTFS, HIFN and Oracle columnar compression. Don’t get sucked into debates about dedupe and compression ratios – as Carter points out in his blog – there’s lots of smoke and mirrors there. Each technology has its place however in-line compression is fundamentally more efficient and in theory can be applied to a wider variety of use cases. Technologies that operate post process do so because that’s the only way they can operate without disrupting IT operations.
Users should be mindful of this and be careful not to get confused by phrases like data reduction for primary storage. Primary in this case may really mean secondary.