Moderator: David Vellante
Analyst: Steve Kenniston
Disk backup, which protects data by performing a backup directly to disk-based media rather than tape storage, is exploding, particularly since data deduplication technologies can be applied to eliminate redundant data within a stream. They can achieve data reduction factors of 20:1 or higher, bringing the economics of disk backup much closer to those of tape backup.
Tape backups often fail and are widely recognized as less reliable than disk-based approaches. Moreover, backup windows are increasingly tight, and recovery is often uncertain with traditional tape-based methods. Finally, the increasing popularity of server virtualization places further stress on the backup and restore process, as spreading data across fewer virtual servers makes tape backup more complicated than a simple brute force physical server-by-server approach.
At a high level, there are two predominant models to disk-based deduplication:
- Source-based deduplication(e.g., EMC/Avamar, Connected, Carbonite, Symantec, and others) uses a host-resident agent that reduces data at the server source and typically sends just changed data over the network (either locally or remotely).
- Target-based dedupe (e.g. Data Domain, Diligent, NetApp and others) is controlled by a storage system, rather than a host. This approach simply takes files or volumes resident on disk and dumps them, either to a cloned set of disks (which then dumps to backup disks) or directly to the disk-based backup target. The former is more expensive but reduces backup window pressure and minimizes application downtime; the latter is cheaper and simpler.
Where should customers consider each of these approaches? In general, target-based dedupe is an excellent fit for customers who want to install a virtual tape library (VTL) without substantial disruption to existing backup software infrastructure and processes. A VTL without dedupe, while convenient for recovery, is perhaps still as much as 10X the cost of tape (or 4X-5X if blending tape with integrated VTL), whereas a VTL with dedupe can take this ratio down to as low as 2-3X tape costs.
Additionally, target-based dedupe is best for higher change-rate environments (e.g. more than 3% changed data daily) and larger databases (e.g. 200GB+) with more rigorous recovery point objective (RPO) and recovery time objective (RTO) requirements. For example, direct copy to disk in this context improves RPO because this approach is able to handle higher change rates, whereas source-based dedupe will have too many changes to transmit.
Source-based data deduplication will shine in lower change-rate environments (less than 3% change daily), where customers have lots of data distributed remotely and backup today is unreliable, cumbersome, and uncertain (e.g., laptops, PCs, and remote offices). Source-based dedupe also has an advantage when transmitting over remote networks, because the data stream is reduced prior to transmission thereby reducing bandwidth constraints.
Will data de-dupe eliminate tape? A commonly asked question in the Wikibon community is: “Will data deduplication allow us to eliminate tape?” In both source and target-based dedupe, tape involvement may be minimized or eliminated depending on the need to get data off site. There are some examples of customers using remote vaulting to disk to eliminate tape entirely. However this approach requires the additional expense of redundant infrastructure and typically substantial network bandwidth investment. Indeed, for most customers backing up data to disk, dumping to tape and shipping tapes off-site remains the most cost-effective and fastest way to comply with the disaster recovery edicts of the organization.
What guidelines and best practices should customers consider? Customers should start by considering RPO and RTO and understanding needs by classifying data. The extremes are relatively straightforward to address. If the application’s RPO/RTO requirement is many hours or even days, any model will work well. Go for low cost and easy recovery and even consider remotely managed services.
If RPO/RTO requirements are measured in hours or minutes, then a change in infrastructure is going to be harder to justify, and customers will likely want to leverage hardened backup and recovery processes unless they have good reasons to change. Here, internal source-based models will be a difficult sell in the organization unless it’s a database-driven source model (e.g. Oracle), where the database handles the deduping and maintains consistency between volumes and the point-in-time aspects of the process.
As well, customers should always perform dedupe prior to encrypting data.
On balance, disk-based backup and data deduplication should be on every customer’s near-term planning roadmap unless the primary backup application is for data that will not demonstrate good deduplication ratios (e.g. music, movies, and mother nature).
Action Item: The choice of data deduplication applied to disk-based backup is one of how, not when. Customers should start by considering RPO and RTO requirements and assess dedupe relative to current backup methodologies to decide economically which approach is the best strategic fit. In addition to RPO and RTO, comparative metrics should include cost of recovery, operational costs and RAS (reliability, availability and scalability) of backup process.
Footnotes: