Moderator: Peter Burris
Analyst: David Floyer
Users have been struggling for years with the challenges of trying to reduce the amount of storage necessary to support critical applications in their organizations. A technology that has been put forward for quite some time recently received a pretty significant boost in announcements by both Network Appliance and IBM.
Data deduplication promises potentially very high space savings (30%-50%) for storage environments that feature frequent cloning of single pieces of data either at a file, record or block level. Data deduplication takes three basic forms, including in-line, block hashing and logical construct. Each of these different technical approaches has their pros and cons but they all basically seek to find circumstances in which the same bit of data has been replicated multiple times in response to often arbitrary backup and/or application activities.
It is important to note that the types of applications that tend to receive the largest benefit from deduplication tend to be those that feature very high backup and restore requirements such as database backup, software archiving, etc. where the notion of truth in the data becomes very important and as a consequence cloning of that data is often repeated across different application forms (e.g. to data warehouses, etc.).
The concerns users will face as they evaluate data deduplication today are a few but important nonetheless. The most significant is data deduplication is applied utilizing proprietary formats. Data is written directly into the file headers that basically describe how the data has been deduped and presents pointers to applications so that those applications can be assured that they will get access to the copy of the data that they need. The system of pointers that results from these technologies can lead to some performance degradation. Indeed storage environments which benefit the most from data-deduplication are likely also to be those that face the greatest performance concerns. Additionally, it is critical that encryption occur after data deduplication to ensure that overall integrity and other very basic concerns regarding storage can be maintained.
We will see a fair amount of discussion regarding how data deduplication can be a general purpose replacement for tape in a backup restore scenario. However, due to a variety of reasons, not the least of which remains the cost of communicating large volumes of data over potentially great distances, tape will continue to have a viable life for the foreseeable future despite some of the advantages of data deduplication. At this juncture it is safe to say the best current data deduplication implementation offers no major advantage over the worst tape solution for very high volume data backup and recovery applications.
As we look forward, we see deduplication becoming a critical enabling technology that can be successfully paired with other emerging storage technologies including thin provisioning, virtualization and others. However it is imperative that users take very close looks at the tradeoffs between the advantages of deduplication and the potential performance costs on the one hand and on the other hand fully understand the consequences of buying into yet another storage technology featuring relatively proprietary formats.
Action Item: Data deduplication is emerging as a critically important new arrow in the storage administrator's quiver to answer hard questions about the increasing problem in storage growth costs. However like all technology arrows, users must be careful to choose which targets to shoot data deduplication at and be very certain in their aim.