On April 3, 2012, the Wikibon community held a Peer Incite to discuss the selection of data protection solutions for cloud storage. We were joined by Mike Adams, a storage specialist at Lighthouse Computer Services, who highlighted some of the key selection criteria he used in designing data protection for the company's cloud storage offering.
Without question, the use of public, private, and hybrid cloud storage offerings is on the rise. They are being used as repositories for desktop and server backups, for both active and static data archives, for file sharing, and as primary storage. Each workload and each use case may require a different storage architecture, however. With this Peer Incite’s focus on backup and disaster recovery (DR), it’s important to understand the various roles of backup.
Backup serves two primary functions in the data center: to provide a local copy of data to be used should an application or infrastructure component fail or data become corrupted and to provide a copy of data which can be maintained off-site, and which can be restored at another location, should the primary data center no longer be available.
The challenges of backup are numerous, but so, too, are the options. One major challenge for organizations has been the backup window. The growth of data, the growth in the number of applications, and the shrinking of the time during which production applications can be taken off-line, combine to make it nearly impossible for organizations to backup applications using historical nightly-backup, tape-based methods.
Point-in-time snapshots, particularly those that are application-aware and application-consistent, enable organizations to expand the backup window, whether the target of the backup process is traditional tape, disk, or off-premise cloud storage. In fact, some organizations have increased the frequency of snapshots and eliminated the backup process altogether. The major caveat with this approach is to ensure that the snapshots are, in fact, application-consistent, as they would be with traditional backup, and not simply crash-consistent snapshots. Application consistency is critical to a simple recovery, but it does come with a trade-off since applications must be quiesced and cache buffers flushed, before the snapshot is taken.
All customers of Lighthouse Computer Services are already using the Actifio Protection and Availability Storage (PAS) platform to provide on-premise data protection. Actifio stores application-consistent, point-in-time copies of production data in a deduplicated repository. This enables multiple restore points and minimizes the amount of on-site capacity required to keep multiple versions of the data. The solution uses pointers to data blocks to re-hydrate specific mount points. This enables very rapid restoration of data into a usable production data set, which can either be run off of the Actifio PAS or transferred to higher-performance storage if demanded by the workload and service levels.
With the snapshot and versioning capabilities of Actifio, it was logical for Lighthouse to then offer a service to extend a replica of the deduplicated data into Lighthouse’s cloud storage offering via an asynchronous link. The use of asynchronous replication was critical to reduce network expense. This enabled the first steps of a DR solution: frequent transfers of production datasets to an off-premise location and the affordable maintenance of multiple recovery points. Because the data is deduplicated after the initial data transfer, the only data that needs to be transferred is the changed blocks. This substantially reduces the bandwidth required to maintain the multiple restore points.
Assuming a disaster such as a fire, where the production data center is destroyed, getting the data off-premise is only the first step in a DR solution. In order to restore applications, the data must be restored to the location where the applications will now be running. One way to accomplish this is to rent access to servers, either on demand or reserved, in the same site as the off-premise data is stored. The other is to export a copy, either to removable disk or tape, and physically transport the data to the recovery data center. The first alternative provides limited flexibility and is likely more costly but allows very rapid recovery. The second provides greater flexibility and minimizes cost, but requires a longer time to recover, since the data must be exported and transported to a new location. While network transfers offer a third alternative that would provide a great deal of flexibility, transferring large amounts of data over the network, whether or not deduplicated, would likely either take too long or be too costly for most organizations.
Regardless of whether the recovery is at the same location as the cloud repository or in an alternative location, when the primary data center is eventually either restored or rebuilt, the organization will need to transfer the production data back to the primary data center. This can either be done in a slow-drip fashion using asynchronous replication and a long time period, or more quickly using an export-to-disk or export-to-tape process. Once the data is transferred or restored at the primary data center, a relatively brief resynchronization process will bring the data back to production-ready status.
Action Item: Many organizations are looking to reduce or eliminate the need for tape and traditional backup processes. CIOs looking at next generation backup solutions should at the same time re-evaluate disaster recovery options and strategies. The combination of server and desktop virtualization, application-consistent snapshots, de-duplication, and asynchronous replication makes it possible to consider a cloud-based, off-premise backup of data for disaster recovery. That said, CIOs should pay particular attention to any limits on the range of supported applications and the process for delivering the application-consistent disaster recovery data sets to the recovery site. Co-locating recovery servers and cloud-based snapshots may be a best practice, but CIOs must also consider the method by which they will fail back to the primary data center, once it can again be placed in production.
Footnotes: