Storage Peer Incite: Notes from Wikibon’s June 24, 2008 Research Meeting
Moderator: Dave Vellante & Analyst: Blair Parkhill
Microsoft Exchange is the most common email technology used in U.S. businesses today. As these organizations upgrade to Exchange 2007, many are increasing the maximum size of individual email boxes. With the increasing importance of email both in business operations and in possible legal actions, IT cannot afford not to make backup copies of these emails along with the rest of the organization's important data. However, backup windows are not expanding. Therefore, IT shops need to find ways to backup Exchange email faster.
This realization sent HP's Customer Focused Testing group on a year-long search for the optimal methods for backing up Exchange 2007 data in the shortest amount of time. On Tuesday, Blair Parkhill discussed the results of that quest with the Wikibon community during its Peer Incite meeting. That presentation triggered the articles in this week's Peer Incite newsletter. G. Berton Latamore
Best practices in Exchange 2007 backup and recovery
Anticipating that Exchange customers will increase mailbox capacity quotas when they upgrade to Exchange 2007, HP's Customer Focused Testing (CFT) Group last year initiated a project to identify the best way to reduce backup windows in Exchange 2007 environments. Because users will have more data to protect with no more time to perform backups, they will need to find ways to reliably accelerate backup processes.
The basic premise of this Peer Incite is that by taking advantage of HP’s testing efforts and leveraging its recommendations, Exchange customers can make better technology choices for their specific environments, optimize backup performance, speed implementation of new processes, cut costs, and reduce implementation risks.
The Project HP configured Microsoft Exchange 2007 using HP Proliant BladeServers and the HP EVA8000 as the primary Exchange data store. The HP EVA6000 was used as the disk-to-disk target array and was configured with FATA drives. An HP ESL 712e tape library was used as dedicated backup and restore device. HP tested NTBackup as the backup application.
The EVA8000 used 146 300GB 10K rpm FC drives and the EVA6000 used 80 500GB 10K FATA drives. The ESL tape library was used as a nearline storage device and was configured with 16 HP LTO3 FC tape drives. The system tested two Exchange 2007 mailbox servers, the first hosting 5,000 users with 100MB mailboxes, and the second, 5,000 users with 1GB mailboxes.
HP ran a number of tests to determine the effect of concurrency on backup performance. Specifically, HP tested various combinations of backup using X number of concurrent streams going to Y number of disks or tape drives—ranging, including 4 concurrent streams to 4 disks, 4 concurrent streams to 4 tape drives, 32 concurrent streams to 32 disks, and 10 concurrent streams to 10 tape drives and other configurations.
HP wanted to understand the core differences in the streaming API for Exchange 2003 versus Exchange 2007 and determine the degree to which Exchange 2007 enhancements could be exploited, namely the ability to support more storage groups and hence greater concurrency during backups.
Findings
What are the most effective techniques to reduce Exchange backup windows?
HP testing found that in general, using more storage groups in Exchange 2007 will result in greater backup concurrency, which should provide faster backups. However increasing the number of storage groups can also increase complexity for storage administrators and in some cases degrade performance if server kernel memory becomes overtaxed. HP found that reducing the number of storage partitions (e.g. sending multiple streams to a single disk volume) can be an effective backup technique without degrading performance and without the need to interleave data, which would negatively impact restore performance.
In addition, when imposing the storage group limitations of Exchange 2003 (i.e. four storage groups), HP found that tape devices were more efficient and could achieve higher transfer rates than disk-to-disk backup. Specifically, four concurrent streams to four tape devices using hardware compression and data buffering allowed the backup of a 520GB database to be completed in 1 hour and 30 minutes versus 1 hour and 50 minutes using four streams to four disk devices.
On balance, however, increasing the number of storage groups and concurrent streams and using disk-to-disk backup yields the best performance and reliability for NTBackup. The improvements are not linear, however, and users should be aware that increasing the number of storage groups and concurrent streams can tax server resources and increase complexity. Backup Executive, DataProtector, and NetBackup users may find better performance with tape.
What about Volume Shadow Copy Service (VSS)?
VSS allows a point-in-time copy of open files and databases. Microsoft has improved VSS in support of Exchange 2007 and simplified its use, although users should be cautious, because VSS uses a serial method and concurrent backups can be problematic if certain criteria are not met.
VSS can offer options to users looking for continuous or near continuous backup solutions. Also, VSS can allow for passive database backup where a replica of the active database can be created and backed up on a separate set of disks, reducing contention for server and storage resources.
For applications requiring consistent point-in-time copies, VSS is pointing the way in which servers, storage and applications interact in Exchange environments.
What are the tradeoffs between using disk- versus tape-based backup in Exchange?
The most important advantage of disk-to-disk backup is it can be initiated with multiple streams to the same target without spreading data across multiple pieces of media. This takes advantage of disk’s random access methodology and keeps recovery times acceptable. Disk-based backup, however, introduces substantial server overhead due to file systems management needs. Memory and CPU overhead can also be significant in large backup scenarios.
The main advantage of tape-based backup is it remains the fastest elapsed time technique. If RPO is critical, put in lots of tape drives and ignore the costs. With multiple tape drives, recovery is enhanced due to the multiple streams.
Advice for Administrators
Critical best practices that emerged from this project include:
- Larger is not necessarily better-- keep Exchange database files to manageable sizes (e.g. around 25-50GB if possible) and use more storage groups and volumes.
- Backup disk array configuration data. This will accelerate restoration in the event the array needs to be replaced.
- Watch event logs. Windows event logs provide excellent visibility into activities such as log truncation and general backup health.
- Monitor server workloads to ensure that the backup job is not overtaxing server resources—keep page pool memory below 180MB for optimal performance and efficiency.
Action item: Exchange 2007 customers should rethink backup and restore processes and take advantage of support for increased concurrency and simplified point-in-time copy facilities. In general, the more concurrent streams made available during backup, the faster backups will perform. However, users should balance these benefits with cost and complexity of increased concurrency and storage group management.
Exchange 2007 offers an opportunity to re-assess the backup picture
One of the main themes of Exchange 2007 is unified communications bringing email, voice, and other collaborative capabilities together in a single system. With nearly 150 million global Exchange users, Microsoft's vision of the future of email and communications is credible and will probably change the notion of what an email platform is. As more organizations adopt Exchange 2007, IT needs to re-think data protection strategies in the face of that vision.
Rather than apply today's recovery parameters to Exchange 2007, IT needs to step back and consider the big picture. As data protection evolves into next generation solutions it's important not just to gravitate to point technologies (e.g. virtual tape, de-dupe, etc.) because they are hot or fit neatly into existing processes. For the past five years, process has dictated choice of the backup solution that is the simplest, most cost effective and least disruptive.
At a minimum, users should start with RPO and RTO requirements that meet their evolving communications needs, and ideally organizations should more aggressively adopt information classification policies. Tiering data by RTO/RPO requirements allows the right data protection technology solution to be applied. IT is now responsible for 100% of corporate data, including the unified communications capabilities that Exchange 2007 brings. This includes emails, remote users, mobile devices, and voice. More than ever, one size doesn't fit all.
Action item: Practically speaking, the concurrency enhancements in Exchange 2007 combined with next generation backup infrastructure represent an opportunity for IT to design backup and recovery for what communications will look like in the coming years. IT needs to articulate that vision and shape it and the consequent data protection solutions around it. This will allow next-generation backup and recovery infrastructure to align more closely with the evolving needs of organizations in the coming decade.
Organize for Quicker Restores with Exchange
For some users, it might be a good idea to let the Exchange administrator handle first level backups and restores. The key advantage here is expediency. Only one organization needs to be involved when Exchange data needs to be recovered.
NTBackup is the built-in backup utility of Microsoft Windows, introduced in Windows NT around 1997 and part of all subsequent versions up to and including Windows Server 2003, Windows 2000 and Windows XP. It uses a proprietary backup format (BKF) to backup files. It also supports backing up files to tape, but not very well, and we don’t recommend it.
NTBackup from Windows XP and later includes Volume Shadow Copy (VSS)(see The_chicken_and_egg_of_VSS) support and thus can backup locked files. Microsoft also includes integrated VSS Requestor/Writer/Provider software for Exchange.
In Windows Vista and later operating systems, NTBackup was replaced by Windows Backup, which uses the Virtual Hard Disk (VHD) file format and supports backing up to modern media such as CDs/DVDs. For reading older backups, Microsoft has made available the NTBackup Restore utility which can only read BKF files.
Many shops have found that they can delegate responsibility for first level Exchange backups to the Exchange administrator who creates these BKF files using NTBackup. Then the storage team subsequently backs up these files using a more robust BURA infrastructure.
But NTBackup is not for everybody. Many users use third-party BURA software such as Backup Exec across the entire shop and not just for Exchange. Responsibility for this infrastructure clearly falls in the storage camp and not the Exchange administrator.
Action item: Consider delegating first level Exchange backup and restore to the Exchange administrator.
VSS: All that glisters is not gold*
In a companion alert (.pst file: The scourge of IT) I pointed out some of the significant benefits of using Volume Shadow Copy Service (VSS) as a basis for backing up Exchange databases. These include being able to take backups without disrupting service and supporting array-based snapshots to increase the number and speed of copies.
However, Microsoft’s VSS technology is young and has still has a ways to go to be operationally mature in large enterprises. One such example is the limit on concurrency with VSS, which have been shown in HP’s tests to be approximately 8 concurrent jobs. Higher levels than this increases the probability of VSS timeouts, which usually means backup failure, operator intervention, and restarting the backup stream. This is aggravated because VSS is dependent on data at a volume level. If concurrent jobs from databases or storage groups in the same partition are initiated together, this can also cause VSS snapshot creation timeouts and backup failures.
This particular limitation can be crudely overcome by offsetting the start times for VSS backups in scripts, but significant testing and operator training would be required to make this stable in a production environment.
Action item: VSS is here to stay, and will be used increasingly as a key component in the Exchange eco-system. However, large installations will need to do significant operational stress testing of VSS to ensure that it not only works in normal situations but also in degraded situations. One practical strategy is to try to ensure that exchange setups are as close as possible to Microsoft’s own internal email service implementation. Problems from Microsoft’s internal customers seem to get resolved the quickest.
For Microsoft Exchange - Backup is Important, Recovery is Everything
Microsoft Exchange has evolved from a straightforward e-mail communications tool to a full-blown mission critical application behaving as a database containing critical documents, images and files. For mission critical applications backup is important, but recovery is everything. Exchange is no exception to this. For IT organizations, much of the mental focus for critical applications availability focuses on the backup process to ensure if there is 1) enough space and 2) the backup will complete in time. Unfortunately less focus is given to the recovery process. Recovery usually means loss of access to an application for the duration of the recovery and can cost companies from $25,000 to over $4 million hourly, depending on the business. Effectively managing the Exchange recovery process is becoming increasingly important with RPO and RTO requirements now similar to those of transaction systems.
The majority of data and storage related problems are discovered fairly quickly and, therefore, the majority of data-recovery operations begin within a relatively short time following the actual failure. On average, approximately 90 percent of all data-recovery operations occur within 24 hours after the initial problem. This means that the problem was detected and corrective actions were taken within 24 hours. Nearly 95 percent of all data recoveries are completed within one week of the problem detection, and over 99 percent of all data recoveries occur within a month. Given the widespread use and increasingly critical role that Exchange plays, choosing the appropriate technologies that will deliver the RTO and RPO needed are key to maintaining high availability levels for Exchange. Options include 1)Virtual Tape Libraries (disk drive solution only, that appears as tape), 2) Integrated Virtual Tape Libraries (robotic libraries with a disk array buffer as a front-end), 3) robotic tape libraries and 4) manual, human mounted, tape drives. Early studies suggest that tape is faster than disk for the Exchange backup application and may be also for the recovery process given the very large sizes of Exchange files, which can exceed 100 GB. Carefully planning this process in your specific environment is key.
Action item: As the critical role of Exchange grows daily, building a very fast, best-of-breed backup/recovery architecture becomes increasingly important to avoid lost business and revenue from an Exchange recovery. Users should take the time to analyze and size their typical backup streams and determine their required RTO and RPO to select the technology that best fits their needs. This can be done and it may take a little effort, but given the role Exchange is playing, it will definitely be worth the effort.
The chicken and egg of VSS
With Exchange 2007, Microsoft has indicated that Volume Shadow Copy Services (otherwise known as VSS) is the future data protection methodology. VSS allows organizations to make point-in-time backups of an Exchange database (EDB) using third-party backup software. The VSS framework has evolved but by many accounts still is not robust enough for users.
The VSS framework has three main parts:
- A Requestor, which initiates and controls the creation of a copy of the EDB (typically a backup software application);
- A Writer, which does all the database housekeeping to prepare the EDB for a copy (typically the Exchange database or SQL); and,
- A Provider, which handles the actual copy function itself (typically an external storage array).
This backup ecosystem comprised of Microsoft, third party backup providers, and array vendors, is evolving but too slowly. As such, many users are uncomfortable betting the farm on VSS, and this will slow the adoption of next generation data protection approaches in Exchange 2007 environments. The industry needs to push the sophistication of VSS in general and storage providers specifically need to understand Exchange 2007 use cases and customer requirements.
Action item: Microsoft's VSS framework represents an opportunity for backup vendors and array companies to partner with Microsoft to advance copy services and near continuous data protection. However the industry in general and Microsoft in particular must more aggressively push the sophistication of VSS and the integration within Exchange 2007 environments. Only when users have good visibility that this framework will meet future business requirements will Microsoft's VSS mandate be substantiated by market adoption.
.pst file: The scourge of IT
Exchange administrators have always walked a tightrope between the size of the Exchange database and the length of the backup windows. The size is kept down with often draconian limitations on the amount of space that users were allowed and the length of time that emails could be kept.
These restrictions create a problem for users – how to keep the ability to access these emails. The result was the start of the .pst file plague, which occupies masses of hard disk space on PCs and file and print servers. In one large pharmaceutical company, about 50% of the file and print space was wasted on .pst files. Once the user has created the .pst file, he cannot get rid of them. This is a financial burden, a productivity killer, and a business exposure to legal discovery.
Volume Shadow Copy Service (VSS) implemented in Microsoft Windows Server 2003 allows open files to be backed up. Exchange 2007 goes further and supports array-based snapshots using VSS; this enables many backup copies to be taken during the day. In general an array-based snapshot takes significantly less time to restore a backup, and because they can be more frequent, it is easier to pick a specific time (say 2pm) to restore a backup. As a result, larger databases can be restored more quickly and still meet the enterprise RPO/RTO requirements for email.
Action item: IT executives should persuade the business to spend IT budget on significantly improving backup/restore capabilities for large Exchange databases with VSS and Exchange 2007. They should use these capabilities to remove user constraints on email management, and together with other solutions (e.g., email archiving) aim to obviate the need for users to keep .pst files. User productivity and risk reduction will more than justify this strategy.