Storage Peer Incite: Notes from Wikibon’s June 4, 2007 Research Meeting
This week Wikibon and Storage Markets present Future perfect scenario: Standards keep tape from getting duped. Storage Markets applies the stock market model to predicting future events in the storage market by giving registered users the ability to "place bets" by buying and selling "shares" of predictions about the market, one of which is that a data deduplication standard will emerge within the next year. This methodology has proven to be accurate in political scenarios and is used to predict elections. Wikibon took the prediction that the dedupe standard would emerge and focused its community, in this week's Peer Incite Meeting, to putting meat on the bare bones of the prediction, presuming that it comes to pass by June 30, 2008.
Contents |
Future perfect scenario: Standards keep tape from getting duped
Today is June 30, 2008 in the forecasting game we played at the Tuesday, June 5, 2006 Wikibon Peer Incite Meeting. We "looked back" at what has occurred in the past year (June 2007-June 2008) to foster the emergence of deduplication as a customer purchase requirement in data protection products, based on a prediction being tested by Storage Markets. We convened the Wikibon community to talk about the key factors that must have occurred for this prediction to transpire.
First it seems clear that tape vendors, pushed by eroding price points driven by competition from disk technology, will have created a data deduplication standard and announced aggressive schedules for developing an industry standard implementation of that standard and incorporating that implementation in their tape controllers. The driving force in this will have been IBM, the force behind most tape standards of the past three decades, which is seeking to unify its tape and disk businesses by implementing the new standard in both sets of products. This also has the effect of moving more processing from servers into the tape and disk controllers.
At present, users by and large continue to recognize tape as the most cost-effective choice for meeting the recovery point objectives of their disaster recovery plans. Despite lower compression ratios available on some other technologies, tape still has the advantage that it can be moved from one place to another over great distances at low cost with high certitude of safe arrival compared to transporting the same volume of data via network. However disk economics are growing in attractiveness, prompting the major tape system providers to announce joint support for the deduplication standard to reestablish leadership in the backup/restore marketplace and stop price erosion.
Disk vendors, on the other hand, will have had a lukewarm response to the new standard but nonetheless show a willingness to respond if in fact real market momentum develops behind it. We expect however that it will take six months to a year for them to move toward compliance to the standard.
One reason users will have been forcing this issue is the realization that, at this point in technological development, data deduplication provides an easier and more natural path to significant cost savings than thin provisioning. Consequently in the “last” year we will have seen much more rapid development and deployment of deduplication products than of thin provisioning. Interest remains high in thin provisioning, however, and we still expect more action in this area in the future.
Finally, we will have seen the first hints of business and government action to push standards as a way to ensure that they can remain in control of their data and not become locked into proprietary traps in the face of increasing concern about the realities of managing information assets and liabilities. Specifically, the European Union will have started the first hearings toward potential regulatory requirements for data quality, data control and data distribution, forcing large institutions to drive their compliance efforts down closer to the technologies that service the applications that are central to business activity.
Action Item: It will have appeared that users will have adopted data deduplication as an important first step in facilitating the transformation in how they administer storage in a storage market that currently is experiencing significant change. Users should continue to drive their suppliers to greater support of standards that will improve users’ ability to manage their information liabilities and assets. However, they should not regard deduplication as a cure-all for out-of-control data growth. Dedup is most effective in data backup and recovery applications, where large volumes of unchanged data are being stored over and over. In other applications, such as ERP and CRM, dedup will not offer major advantages. In these applications users will be best advised to wait for the development of thin provisioning technology.
Who owns de-dupe?
IT executives can expect a queue of people wanting to be responsible for saving 50% plus of storage budgets by implementing deduplication. However, one person's savings could easily cost other people plenty. There are clear trade-offs between lowering disk costs, impacting application performance, increasing RPOs, and increasing bandwidth costs. It will be easy for poorly aligned budget systems to drive inappropriate decisions.
Action Item: IT executives should not let deduplication out as an IT infrastructure standard any time soon, especially if no deduplication standard is in place. CTO skunk works or controlled experiments on the total systems impact of deduplication on a few specific applications is appropriate initially to build up practical experience and pragmatic guidelines. IT executives should initially counter excessive vendor hype and dampen expectations for deduplication both within IT and outside IT.
Data de-duplication and the low-end backup/restore choice
Tape is the most ancient of storage life forms, yet it, too, is poised to have to learn a new trick: data de-duplication. Very likely, this will be the technology that demarcates a new generation of tape products, as tape vendors incorporate data de-duplication into a new class of tape controllers. Large enterprises most likely will the first to pursue backup/restore strategies based on modern, evolving tape products. However, smaller shops that have been satisfying less-complex backup/restore requirements with older tape technologies should weigh a move to higher-end tape environments carefully against real options to use disk-based backup/restore technologies or 3rd-party backup/restore services. Ultimately, a move to higher-end tape technologies will be juxtaposed against the costs of devoting incremental network bandwidth to backup/restore and establishing "active" offsite storage space (i.e., space in which to run disk, as opposed to simply shelve tape.) In evaluating this option, users should evaluate carefully the full implications of such a move. There are clear advantages in automation and data recovery for high value data, but equally clear disadvantages in longer RPO's (more data lost in the case of a disaster) and higher costs.
If data duplication standards emerge, it will be good news for either option. Bandwidth requirements will be lower, and tape backups will be quicker and take less time.
Action Item: Tape vendors are likely to introduce a wave of new invention around technologies like data de-duplication in response both to real high-end customer requirements and revenue pressures. Rather than investing in a new round of tape technology, smaller IT organizations facing less complex backup/restore requirements should consider the pros and cons of alternative disk-based and 3rd-party approaches to handling backup/restore.
Integrating deduplication into IT infrastructure
True integration of deduplication into the IT infrastructure would allow data to be held in deduplicated format except when it is being processed. The data could be created, copied locally, copied remotely, migrated, backed up and recovered in deduplicated form. Applications could know about and control whether deduplication is evoked, and control any parameters.
This vision of deduplication as a storage service evokes three key questions for IT executives:
- Is it technically possible to create a deduplication standard?
- It is technically possible to create a standard and retain plenty of room for innovation. Deduplication has two parts. The hard part with plenty of room for innovation is creating the algorithms that detect common strings. The easier part is creating a schema that connects where duplicates have been found with the duplicate strings, lays out how the data is held within the same data volume, and allows protection of and controlled access to the data. This schema would be the standard, not the algorithms.
- Can the market create and sustain such a standard?
- Dave Vellante nicely addresses a potential win-win-win scenario between tape vendors, disk vendors and users in his piece “Data deduplication standards and the domino effect”. The market would effectively stall any attempt of a single company to establish a de facto standard. De jure is the only way.
- What is the degree of adoption that should be tried with and without a standard?
- Adoption of deduplication becomes an order of magnitude easier if a deduplication standard is in place. Without a standard it is necessary to deduplicate the data every time it is moved, and software vendors would have little incentive to support deduplication.
Action Item: IT executives should wait until deduplication standards that connect disk, tape and applications are in place and supported by major vendors before embracing the full integration of deduplication into the IT infrastructure. In the meantime, restrict implementations to point solutions for specific applications that have clear and high returns, and document clear exit strategies. Oh yes, invite Bill and Scott to lunch together.
Data deduplication standards and the domino effect
Sometimes it feels like tape vendors are asleep at the wheel. While disk vendors aggressively market data deduplication as a tape replacement, tape suppliers appear indifferent and sanguine as they eye new tape applications in fixed content, archiving, compliance and other tier 3 applications as though these will somehow insulate them. Freeman Associates reports that tape users purchased 50% more capacity in 2006 relative to 2005 but revenue still fell, underscoring the imperative for tape vendors to develop data deduplication technologies and push for standards. Indeed, IBM may be the sole hope of advancing such deduplication standards and further opening the opportunities for dedupe adoption.
The domino theory goes something like this. IBM, understanding the user benefits of seamlessly integrating disk and tape deduplication and recognizing it has more to gain than to lose by promoting a standard, uses its deep expertise in tape and its leverage as a leader in storage to spearhead the implementation of deduplication in tape and promote an industry-wide standard. Other tape vendors join in, recognizing an opportunity to protect both existing backup franchises and newer, emerging markets by attempting to preserve a 1/2 century-long economic advantage over disk. Backup software suppliers happily support the standard to both hedge bets and widen market opportunities. Disk vendors are placed in the uncomfortable position of having to buck the standard or support an initiative led by a competitor.
Action Item: Tape has always been a story of fragmented formats where the rising tide of standardization raised all ships. Tape is a game of survival where no one wins unless everyone works together. Sun's Scott McNealy, in an effort to protect his $4.1B investment in STK should get on a plane and visit IBM's Bill Zeitler to create a data deduplication standard in tape and disk.
Tape deduplication standard is a win for users
It would be nice to apply data deduplication at any application, but unfortunately it's not that simple. Performance considerations are important with this emerging technology. Writes are fairly clean, with any overhead of applying deduplication algorithms offset by the need to store less data, thereby reducing physical movement. But reads require more effort with the system reading the data (hopefully in cache), finding the hash, interpreting the hash and then reading the native data. As such, data deduplication should be primarily aimed at write-intensive applications.
In addition, data deduplication will perform best on serial data streams with no locking involved (e.g., database applications are unlikely to be good candidates). As well, users should target applications where lots of copies are being made over time and similar copies of data are being moved. Candidate applications include backup, archiving and even certain specific data-mining operations. For some larger applications certain files (e.g., history files) will be strong candidates.
Once these are determined, despite vendor implications to the contrary, users still should assume backing up at least some of the deduped data to tape. Today, the lack of deduplication technology in tape and the absence of standards means that data must be un-deduped to be backed up to tape. This adds additional overhead, complexity and elapsed time to backup and restore operations. A common deduplication standard across disk and tape would allow faster backups and allow deduped data to be restored to any disk that supports the standard, further simplifying operations and reducing reliance on proprietary vendor implementations.
Action Item: Users should push hard for both tape and disk vendors to develop data deduplication standards to facilitate simpler backup and restore operations. Organizations should be very careful about broadly committing to a single vendor solution without understanding the holistic implications on disk-tape-disk cloning, backup and restore processes.