Among the most popular features of modern storage appliances is the concept of data deduplication. While most storage buyers have an abstract understanding of the benefits wrought by this feature, this article will describe ten facts regarding data deduplication. Some may strike you as obvious and some will not, but you will learn that data deduplication includes much more than meets the eye.
What is data deduplication
Any understanding of the facts about data deduplication requires a common definition of the term. Data deduplication is a data reduction technology that eliminates redundant data. When data is deduplicated, a single instance of duplicate information is retained while the duplicate instances are replaced with pointers to this single copy. A comprehensive index is still maintained so that all data can be transparently accessed.
Here, the focus is on disk-based deduplication, but understanding that deduplication can happen elsewhere is important.
Data deduplication brings with it direct business benefits
Although technology people love data deduplication on its technical merits, no technology should be deployed that doesn't provide direct benefit to the business. Fortunately, data deduplication does, in fact, boast serious business credibility in a number of different ways.
Most obviously, deduplication creates direct cost savings. It reduces the amount of raw storage space necessary by eliminating redundant data elements, leaving only a single real copy consuming the storage space. Perhaps the most obvious example of a form of data deduplication is the single instancing storage feature that used to be a part of Microsoft Exchange that stores one copy of an e-mail sent to multiple recipients; all other “copies” are pointers in the recipient's mailbox. So, an original 10 MB message sent to 100 recipients takes up 10MB of storage space rather than 1,000 MB. This creates obvious measurable cost savings. Less disk utilization always translates into lower costs.
Compression alone is not data deduplication
Compression and data deduplication are not the same; they are, in fact, complementary data reduction technologies and both can serve as ways to reduce the cost of ownership for storage infrastructures.
Whereas data deduplication happens with larger chucks of data, compression works on byte patterns across larger sets that individually are usually only a few bytes long. With this difference in granularity, it’s apparent how the two technologies can complement one another.
Data deduplication shortens the backup window
As organizations gather more-and-more data and expand their data centers, backup times grow to unreasonable lengths. Any technology that can reduce the actual amount of data that has to traverse the network on its way to a backup system or that speeds the process can shorten this critical data protection activity.
By keeping only a single copy of redundant data elements, data deduplication provides a significant positive impact with regard to the length of the data backup window.
Different kinds of data deduplication
From the storage side of the house, two primary kinds of deduplication are discussed here: file and block dedupe. From a meta perspective, block-level deduplication operates at the volume level by deduplicating the blocks of data that comprise the volume. File-level deduplication works, as its name implies, at the file level. If duplicate files are in the deduplication domain, they are single-instanced.
File deduplication is generally considered a coarse level of deduplication and block level, fine-grained. As such, block level deduplication can often yields more substantial results than file-level dedupe.
File deduplication works on whole identical files. Block dedupe works even on just similar files. For example, in a case where multiple edits of a document or changes to a spreadsheet are maintained, a file may have several copies, each with a few words changed. File dedupe wouldn’t work on these files, but block dedupe may be able to dedupe at the sub-file or block level.
There are multiple data deduplication modes
Data deduplication can happen in a couple of different ways.
- Inline: Sometimes called in-band deduplication, it performs the deduplication operation while data is being written to the storage medium. This method increases the continuous processing load on the storage since deduplication is always happening. Further, this method could result in a bit of additional write latency since the data needs to be processed prior to being written.
- Post-process: Also known as out-of-band deduplication, this happens after the data has already been written to disk. This means that the storage needs to support the original data size while it awaits the next deduplication window.
In either case, deduplication of production databases will cause some delay in read operations as the data being retrieved is "re-hydrated" before it is loaded into memory. For this reason, many organizations use dedup mainly for backup copies.
Data deduplication can happen in one of two different places
The deduplication process can happen in one of two places—at either the target or the source. This part of the article has to do primarily with deduplication during backup, when it matters where the deduplication happens.
- Target: Deduplication happens at the backup target. This works well when you want to use backup software that doesn’t dedupe, but your target storage does. However, this approach will do little to shorten backup times, since the entire data set is being transmitted through the network to the target.
- Source: When you’re backing up a system using source-based deduplication, the backup agent deduplicates data before sending it to the backup target. This reduces the amount of data that is sent over the network, and therefore transmission time, and the amount of data that has to be stored on the backup target.
Deduplication will be a feature of Windows Server 8
Data deduplication will play a pivotal role in Windows Server 8 when an administrator enables the new data deduplication role service. Once this role is enabled and configured, Windows creates scheduled tasks that run to perform the actual deduplication.
However, there are some things that should be understood about this initial deduplication effort from Microsoft:
- Windows Server 8’s dedupe is block-based.
- Deduplication through this method is not supported on the Windows boot volume or system volumes. However, it does work on storage volumes, making it particularly suitable for file servers.
- Only NTFS volumes can be deduped. FAT, FAT32 and ReFS volumes are unsupported.
- Microsoft’s deduplication does not work on compressed or encrypted volumes.
- Windows Server 8 uses a post-process deduplication method.
Deduplication plays a pivotal role in the future of storage
I wrote about this last week, but it bears repeating. A new class of storage is appearing that places data deduplication, compression and other data reduction technologies as a foundational element of the solution. These solutions are being developed to meet the growing need for solid state-based storage devices to meet rising performance needs.
The primary problem with solid state storage is that it can’t meet growing capacity needs without new ways to extend it. That’s where a cluster of data reduction tools comes into play.
As you investigate the current storage market, look for these kinds of trends. This will necessarily change some of the metrics that have been traditionally used by storage buyers. Now, rather than looking at just a dollar-per-TB or dollar-per-IOPS-metric, buyers need to determine what kind of data reduction ratio can be found with a particular product.
Always run pilots before implementing dedupe
For an organization that is considering implementing broad data deduplication technology, there is a lot to learn. Different vendors use very different deduplication methods. These different methods can impact overall storage performance, storage utilization and backup windows, among other things. Prior to deploying a dedupe solution, perform pilots with each and every vendor and put each of them through their paces. Ensure that the solution can meet your organization’s storage performance needs while still meeting the expectations that you have for the deduplication process.
Action Item: Here are a few immediate action items for your consideration:
- Always run pilots. Test everything measurable to determine solution suitability.
- Consider the changing storage landscape as it relates to deduplication.
- Test Windows Server 8’s deduplication technology.
- Understand the full data lifecycle including where data deduplication can have the most benefit.
- Ensure that the deduplication technology meets business needs.
Footnotes: