10 Data Deduplication Facts

Become a Member!

Why Register?

Login

Featured Research

Announcements

Technology Events

Home Profile Peers Wiki Activity Groups Feedback

10 things to know about data deduplication

Currently 5/5 Stars.
1
2
3
4
5

rate this

Last Update: Mar 27, 2012 | 03:11

Viewed 19724 times | Community Rating: 5

Originating Author: Scott Lowe

Among the most popular features of modern storage appliances is the concept of data deduplication. While most storage buyers have an abstract understanding of the benefits wrought by this feature, this article will describe ten facts regarding data deduplication. Some may strike you as obvious and some will not, but you will learn that data deduplication includes much more than meets the eye.

1 What is data deduplication
2 Data deduplication brings with it direct business benefits
3 Compression alone is not data deduplication
4 Data deduplication shortens the backup window
5 Different kinds of data deduplication
6 There are multiple data deduplication modes
7 Data deduplication can happen in one of two different places
8 Deduplication will be a feature of Windows Server 8
9 Deduplication plays a pivotal role in the future of storage
10 Always run pilots before implementing dedupe

What is data deduplication

Any understanding of the facts about data deduplication requires a common definition of the term. Data deduplication is a data reduction technology that eliminates redundant data. When data is deduplicated, a single instance of duplicate information is retained while the duplicate instances are replaced with pointers to this single copy. A comprehensive index is still maintained so that all data can be transparently accessed.

Here, the focus is on disk-based deduplication, but understanding that deduplication can happen elsewhere is important.

Data deduplication brings with it direct business benefits

Although technology people love data deduplication on its technical merits, no technology should be deployed that doesn't provide direct benefit to the business. Fortunately, data deduplication does, in fact, boast serious business credibility in a number of different ways.

Most obviously, deduplication creates direct cost savings. It reduces the amount of raw storage space necessary by eliminating redundant data elements, leaving only a single real copy consuming the storage space. Perhaps the most obvious example of a form of data deduplication is the single instancing storage feature that used to be a part of Microsoft Exchange that stores one copy of an e-mail sent to multiple recipients; all other “copies” are pointers in the recipient's mailbox. So, an original 10 MB message sent to 100 recipients takes up 10MB of storage space rather than 1,000 MB. This creates obvious measurable cost savings. Less disk utilization always translates into lower costs.

Compression alone is not data deduplication

Compression and data deduplication are not the same; they are, in fact, complementary data reduction technologies and both can serve as ways to reduce the cost of ownership for storage infrastructures.

Whereas data deduplication happens with larger chucks of data, compression works on byte patterns across larger sets that individually are usually only a few bytes long. With this difference in granularity, it’s apparent how the two technologies can complement one another.

Data deduplication shortens the backup window

As organizations gather more-and-more data and expand their data centers, backup times grow to unreasonable lengths. Any technology that can reduce the actual amount of data that has to traverse the network on its way to a backup system or that speeds the process can shorten this critical data protection activity.

By keeping only a single copy of redundant data elements, data deduplication provides a significant positive impact with regard to the length of the data backup window.

Different kinds of data deduplication

From the storage side of the house, two primary kinds of deduplication are discussed here: file and block dedupe. From a meta perspective, block-level deduplication operates at the volume level by deduplicating the blocks of data that comprise the volume. File-level deduplication works, as its name implies, at the file level. If duplicate files are in the deduplication domain, they are single-instanced.

File deduplication is generally considered a coarse level of deduplication and block level, fine-grained. As such, block level deduplication can often yields more substantial results than file-level dedupe.

File deduplication works on whole identical files. Block dedupe works even on just similar files. For example, in a case where multiple edits of a document or changes to a spreadsheet are maintained, a file may have several copies, each with a few words changed. File dedupe wouldn’t work on these files, but block dedupe may be able to dedupe at the sub-file or block level.

There are multiple data deduplication modes

Data deduplication can happen in a couple of different ways.

Inline: Sometimes called in-band deduplication, it performs the deduplication operation while data is being written to the storage medium. This method increases the continuous processing load on the storage since deduplication is always happening. Further, this method could result in a bit of additional write latency since the data needs to be processed prior to being written.
Post-process: Also known as out-of-band deduplication, this happens after the data has already been written to disk. This means that the storage needs to support the original data size while it awaits the next deduplication window.

In either case, deduplication of production databases will cause some delay in read operations as the data being retrieved is "re-hydrated" before it is loaded into memory. For this reason, many organizations use dedup mainly for backup copies.

Data deduplication can happen in one of two different places

The deduplication process can happen in one of two places—at either the target or the source. This part of the article has to do primarily with deduplication during backup, when it matters where the deduplication happens.

Target: Deduplication happens at the backup target. This works well when you want to use backup software that doesn’t dedupe, but your target storage does. However, this approach will do little to shorten backup times, since the entire data set is being transmitted through the network to the target.
Source: When you’re backing up a system using source-based deduplication, the backup agent deduplicates data before sending it to the backup target. This reduces the amount of data that is sent over the network, and therefore transmission time, and the amount of data that has to be stored on the backup target.

Deduplication will be a feature of Windows Server 8

Data deduplication will play a pivotal role in Windows Server 8 when an administrator enables the new data deduplication role service. Once this role is enabled and configured, Windows creates scheduled tasks that run to perform the actual deduplication.

However, there are some things that should be understood about this initial deduplication effort from Microsoft:

Windows Server 8’s dedupe is block-based.
Deduplication through this method is not supported on the Windows boot volume or system volumes. However, it does work on storage volumes, making it particularly suitable for file servers.
Only NTFS volumes can be deduped. FAT, FAT32 and ReFS volumes are unsupported.
Microsoft’s deduplication does not work on compressed or encrypted volumes.
Windows Server 8 uses a post-process deduplication method.

Deduplication plays a pivotal role in the future of storage

I wrote about this last week, but it bears repeating. A new class of storage is appearing that places data deduplication, compression and other data reduction technologies as a foundational element of the solution. These solutions are being developed to meet the growing need for solid state-based storage devices to meet rising performance needs.

The primary problem with solid state storage is that it can’t meet growing capacity needs without new ways to extend it. That’s where a cluster of data reduction tools comes into play.

As you investigate the current storage market, look for these kinds of trends. This will necessarily change some of the metrics that have been traditionally used by storage buyers. Now, rather than looking at just a dollar-per-TB or dollar-per-IOPS-metric, buyers need to determine what kind of data reduction ratio can be found with a particular product.

Always run pilots before implementing dedupe

For an organization that is considering implementing broad data deduplication technology, there is a lot to learn. Different vendors use very different deduplication methods. These different methods can impact overall storage performance, storage utilization and backup windows, among other things. Prior to deploying a dedupe solution, perform pilots with each and every vendor and put each of them through their paces. Ensure that the solution can meet your organization’s storage performance needs while still meeting the expectations that you have for the deduplication process.

Action Item: Here are a few immediate action items for your consideration:

Always run pilots. Test everything measurable to determine solution suitability.
Consider the changing storage landscape as it relates to deduplication.
Test Windows Server 8’s deduplication technology.
Understand the full data lifecycle including where data deduplication can have the most benefit.
Ensure that the deduplication technology meets business needs.

Footnotes:

Comments on '10 things to know about data deduplication'

Scott, a couple comments on your blog:

1) Microsoft no longer provides single-instance storage for email attachments. they abandoned this feature with Exchange 2010, saying it was too compute-intense.

2) The article is very backup-centric. Dedupe is also being used successfully on the most expensive storage you'll buy - primary storage.

Larry

Posted By:Larry Freeman| Wed Mar 28, 2012 02:05
Scott,
Terrific post! It’s time that dedupe becomes much better understood and your post is a step in that direction. You are correct in that deduplication needs a better “common definition” because the phrase is being used somewhat ubiquitously and is becoming a check box item.
The spectrum of deduplication solutions is not well understood. You touched upon how dedupe will play a pivotal role in the future of storage. For example, SSD/flash technologies are being adopted because deduplication closes the “cost gap” that is currently inhibiting it. Importantly, deduplication also reduces write cycles and extends SSD/flash life. The application of deduplication enables hybrid storage and even HDD storage to be further optimized as it is applied in use cases beyond back-up including primary and all tiers of storage including cloud storage. When applied to cloud storage in a source-target mode dedupe will optimize WAN utilization by reducing the data sent to the cloud and will impact the data storage costs in the cloud.
Advanced deduplication (like our OEM offering, Albireo) performs at extremely high I/O rates, produces great storage efficiency and scales to multiple petabytes. It is becoming very clear that deduplication has evolved well beyond its initial backup use case and will enable massive improvements in performance and efficiency in data creation, transmission and storage.
Wayne Salpietro
Permabit Technology Corporation

Posted By:wayne salpietro| Wed Mar 28, 2012 02:24
Larry -

Correct. I used the example because it's easy to grasp. I'm actually glad MS removed SIS from Exchange as its really enabled much more deployment flexibility than could be had in the past due to reduced IOPS and lower processor usage.

Scott

Posted By:Scott Lowe| Wed Mar 28, 2012 04:09
GREAT READ!!!
Thanks for posting

I've noticed that Variable and Fixed Block dedupe are all the rage, but Progressive Deduplication seams to offer the best overall dedupe multiplier.

Dedupe to the cloud is gaining popularity, but there is still the issue of "seeding and feeding" the known blocks to the cloud environment. Different elephant, same water hose. :)

Bill Woods
Arkeia Software Inc.

Posted By:Bill Woods| Wed Mar 28, 2012 05:21
Like the article, I speak to mid market customers frequently about dedupe and it's benefits in our products but l feel that there's still a perception that it's "just costly backup kit" (talking about appliance-based feature) Frequently I get the pushback that throwing money at the next great LTO offering is more preferable, but I'm sure that's down to ease of use and familiarity with the technology. If I had to make one statement about considering dedupe and it's perceived benefits then I'd say make sure a proper sizing is done before any investment is made, it's horses for courses and may not be ideal in some situations. It has to be appreciated also that dedupe works better on historical data sets, the more history the better, this in turn pushes up the price of the hardware as more disk is required.........but as dedupe ratios increase, the returns begin. Not necessarily a quick shot fix in most cases! Oh and just on a comment I think Larry made above......yes Dedupe is being used on primary storage, but HP go down the "thin" route and offer "off array" solutions for dedupe in their Storeonce appliances. That applies from P4000 iSCSI SAN, through EVA and up to 3PAR.........no dedupe!

Stuart Graham
HP ESSN Scotland

Posted By:stuart graham| Fri May 25, 2012 11:43

Revision ID	Author	Timestamp	Comment
39850	Wikibon	12 Mar 27 15:11:37
39808	Wikibon Daemon	12 Mar 22 16:33:33
39807	Wikibon Daemon	12 Mar 22 16:33:21
39806	Wikibon Daemon	12 Mar 22 16:33:03
39805	Otherscottlowe	12 Mar 22 16:32:58
39803	Bert Latamore	12 Mar 22 16:29:12
39802	Bert Latamore	12 Mar 22 16:25:46
39799	Otherscottlowe	12 Mar 22 16:02:53
39797	Otherscottlowe	12 Mar 22 15:32:39	Created page with 'Among the most popular features of modern storage appliances is the concept of data deduplication. While most storage buyers have an abstract understanding of the b...'

Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge.