Last week I wrote “From a backup perspective, the world is at an inflection point. Today’s requirements for growth, rapid data access and speedy recovery are outstripping the industry’s ability to solve backup challenges.” I’d like to explore this a bit further.
The basic premise of this post is that while storing data on disk in de-duplicated format is more cost effective than storing non-de-duplicated data, there is very little other advantage, in the backup process to just changing the target where data is stored. IT practitioners, especially those aggressively pursuing virtualization strategies, have an opportunity to re-architect backup processes and dramatically reduce I/O bottlenecks associated with backup.
The Backup Software ‘Brain’
Backup software provides two primary functions:
- To copy and move data to a secondary storage medium.
- To manage the data on the secondary storage medium.
The technology for doing so, at least in the backup software arena has hardly changed in decades.
What does backup software do? Let’s think of the backup software as the brain that moves data from the primary storage tier to the backup storage tier. The process goes something like this; a backup job is kicked off by a ‘master’ that provides three primary functions:
- Scheduling the Backup – i.e. when do I backup?
- Cataloging – what gets backed up and where does it go?
- Policies – e.g. how long do I keep the backups?
Of course there are other control points that the brain manages such as administrative privileges, etc… but these three are the core responsibilities.
Once the backup job is initiated, a client that resides on the systems being protected walks through a file system and finds the files it needs to move. When a ‘full backup’ takes place, all the files are moved. When an ‘incremental backup’ takes place, only the files whose ‘archive bit’ has changed, are moved. The data is packaged up into its own proprietary tape archive format and sent to a secondary storage device. As we all know, historically, the predominant backup medium has been tape (because it’s cheap) hence the backup software understood the language of tape devices.
There have been advancements in the secondary medium where the data is stored, primarily disk replacing tape. Interestingly enough, the most successful disk backup solutions were the ones that emulated tape (VTL). The benefit of disk-based backup is it provides performance improvements during backup and recovery. Additionally, once a backup operation is complete, it is easy to clone the data to tape for long term, off-site storage, outside of the backup window. The nice thing about only having to change out your backup target device is that none of the sunk investment in backup software, infrastructure and process has to change.
Enter Data De-duplication
I left the storage research and advisory business in 1999 to start a software company. When I came back in the middle of this decade, the technology that was most interesting and that I’d never heard of before was data de-duplication. De-dupe hit the streets back in 2000 as a ‘cool new technology’ and really began selling, from small startup vendors (Avamar, Data Domain, others) in 2003/2004. At its core, de-duplication allows you to store less data on disk at a cost comparable to storing data on tape. There are hardware based data de-duplication solutions that take advantage of the backup application sending data to it and then de-duplicate the data (known as target-based de-duplication). There are also software based de-duplication solutions that begin the data de-duplication process on the client that is being protected (known as source-side de-dupe).
Data Domain, the leader in data-deduplication, cleverly exploited the fact that you could deploy its solutions without any changes to existing backup software and the market exploded. Avamar, the leader in source-side de-duplication didn’t really begin to take off until EMC figured out that source-side de-dupe was perfect for remote offices (where you had to push backup streams over a constrained network pipe)
Enter Server Virtualization
But it wasn’t until virtualization kicked in that the folks at Avamar realized the huge potential of source-side technology– that is, a way to reduce I/O bottlenecks in virtualized environments. This brings us back to my primary point in this post. While storing data on disk, in de-duplicated format is more cost effective than storing non-de-duplicated data, by simply emulating tape and changing the target, the backup process isn’t changed. That’s a huge advantage to IT people because changing processes is a pain, however when virtualization is thrown into the mix, the backup approach may warrant new thinking.
Here’s why. One of the fundamental tenants of server virtualization is that servers are underutilized (e.g. 10-20% utilization). By virtualizing servers and sharing physical resources, you can dramatically increase efficiency and eliminate waste. But one application where server capacity is not underutilized is backup. In fact, backup is a server pig. So when you reduce your physical servers from say 100->10, when it comes time to do backups, you run into problems completing backups on time because your server capacity is constrained. Simply put…you need more horsepower. So often, practitioners will squeeze P->V ratios in order to ensure adequate physical resources to complete backups within a window.
De-duplicating at the Source
This is why I’m so high on source based de-duplication solutions in virtualized environments. They have the ability to transform the backup industry, which would be incredibly beneficial for customers. Source based de-duplication solutions add a new level of intelligence to the backup client that ignores files and focuses on file systems and the changed blocks in the file system, only moving changed data. This means significantly less data is moved around which limits I/O bottlenecks and saves time/money on network and secondary storage costs.
The question is will vendors generally and EMC specifically push this philosophy? Prior to the Data Domain acquisition, EMC was pursuing this approach with Avamar. But since there is so much existing backup software deployed, it might be easier for EMC to continue to replace backend devices rather than work with customers to re-architect backup. Further, as I pointed out in my last post, Data Domain concedes nothing to Avamar in VMware environments and is working aggressively to fit the target-based de-dupe square peg into the virtualization round hole.
On a related note, recently at an IBM meeting, Cindy Grossman, IBM’s head of tape and archive storage systems said that IBM believes de-duplicating at the source is the logical approach. Obviously I agreed. But IBM’s strategy with Tivoli Storage Manager and ProtecTier (Diligent) are still unclear in my mind—but I was encouraged nonetheless. And Tivoli’s incremental forever approach holds great promise in my view as a type of source-side data reduction– despite TSM’s complexities.
A new Approach
The old way of architecting backup infrastructure is:
- Dumb client (copy/move)
- Control layer (schedule/catalogue/policy)
- Storage layer – Tier 2
A new/better way, especially for virtualized environments may be to utilize integrated appliances with:
- Smarter client (move less data/use fewer resources)
- Integrated Control/Storage layers (less moving parts, easier to manage)
I know a lot of IT people who will say that changing the backup software is something they want to avoid, and I understand why. But when you’re making a shop-wide decision that involves virtualization, maybe that’s the time to consider a better way.