Storage Peer Incite: Notes from Wikibon’s April 3, 2012 Research Meeting
Recorded audio from the Peer Incite:
On April 3 the Wikibon community gathered to discuss cloud storage as an off-site data backup and recovery option with Storage Specialist Mike Adams of Rhode Island-based Lighthouse Computer Services. Adams started the discussion by suggesting that online storage could replace tape backup. However, Wikibon CTO David Floyer took exception to this idea. Basically he argued that while online storage sounds fine, getting a large database back via online transmission can take a long time given available network transmission speeds. The fastest solution for recovery of, for instance, a large Exchange e-mail database from a remote site, is still usually to ship the backup overnight on tape, removable hard drive, or other physical media.
Lighthouse's answer to this issue is its premium service, which allows the customer to run his core applications on a Lighthouse server in the cloud until the local database can be restored. This provides a very fast -- presumably faster than overnight -- recovery. And in a major disaster such as a building fire that destroys the company data center, it allows the company to restore basic business functions while it relocates to a replacement site.
Given that one of the most common causes of failure of SMBs is a major fire that destroys their place of business as well as all paper and digital records, this is certainly an option that companies should consider, particularly if the core business is a type that could be conducted from employee's homes or temporary rented locations. It would be of less value to a retail business that cannot operate until it can reestablish itself in a permanent location and restock.
However, this still may not provide an adequate answer to the more common disaster of a hardware failure or data corruption problem that requires a reload of the latest backup. For this the best solution is still an on-site backup that can be accessed immediately. And the tradeoff in using the Lighthouse premium service is cost -- it will always be more expensive that tape backup.
The basic question then becomes whether in a disaster that destroys both the main database and the on-site backup the time required to get the latest off-site backup tape to the local restore site is significant. And the larger question is what the company will do after such a disaster. A data recovery plan for such an event only makes sense in the context of a larger, company-wide plan to restore its business processes. And for some businesses the most logical plan may be to shut down and not attempt a restoration at all. Bert Latamore, Editor
On April 3, 2012, the Wikibon community held a Peer Incite to discuss the selection of data protection solutions for cloud storage. We were joined by Mike Adams, a storage specialist at Lighthouse Computer Services, who highlighted some of the key selection criteria he used in designing data protection for the company's cloud storage offering.
Without question, the use of public, private, and hybrid cloud storage offerings is on the rise. They are being used as repositories for desktop and server backups, for both active and static data archives, for file sharing, and as primary storage. Each workload and each use case may require a different storage architecture, however. With this Peer Incite’s focus on backup and disaster recovery (DR), it’s important to understand the various roles of backup.
Backup serves two primary functions in the data center: to provide a local copy of data to be used should an application or infrastructure component fail or data become corrupted and to provide a copy of data which can be maintained off-site, and which can be restored at another location, should the primary data center no longer be available.
The challenges of backup are numerous, but so, too, are the options. One major challenge for organizations has been the backup window. The growth of data, the growth in the number of applications, and the shrinking of the time during which production applications can be taken off-line, combine to make it nearly impossible for organizations to backup applications using historical nightly-backup, tape-based methods.
Point-in-time snapshots, particularly those that are application-aware and application-consistent, enable organizations to expand the backup window, whether the target of the backup process is traditional tape, disk, or off-premise cloud storage. In fact, some organizations have increased the frequency of snapshots and eliminated the backup process altogether. The major caveat with this approach is to ensure that the snapshots are, in fact, application-consistent, as they would be with traditional backup, and not simply crash-consistent snapshots. Application consistency is critical to a simple recovery, but it does come with a trade-off since applications must be quiesced and cache buffers flushed, before the snapshot is taken.
All customers of Lighthouse Computer Services are already using the Actifio Protection and Availability Storage (PAS) platform to provide on-premise data protection. Actifio stores application-consistent, point-in-time copies of production data in a deduplicated repository. This enables multiple restore points and minimizes the amount of on-site capacity required to keep multiple versions of the data. The solution uses pointers to data blocks to re-hydrate specific mount points. This enables very rapid restoration of data into a usable production data set, which can either be run off of the Actifio PAS or transferred to higher-performance storage if demanded by the workload and service levels.
With the snapshot and versioning capabilities of Actifio, it was logical for Lighthouse to then offer a service to extend a replica of the deduplicated data into Lighthouse’s cloud storage offering via an asynchronous link. The use of asynchronous replication was critical to reduce network expense. This enabled the first steps of a DR solution: frequent transfers of production datasets to an off-premise location and the affordable maintenance of multiple recovery points. Because the data is deduplicated after the initial data transfer, the only data that needs to be transferred is the changed blocks. This substantially reduces the bandwidth required to maintain the multiple restore points.
Assuming a disaster such as a fire, where the production data center is destroyed, getting the data off-premise is only the first step in a DR solution. In order to restore applications, the data must be restored to the location where the applications will now be running. One way to accomplish this is to rent access to servers, either on demand or reserved, in the same site as the off-premise data is stored. The other is to export a copy, either to removable disk or tape, and physically transport the data to the recovery data center. The first alternative provides limited flexibility and is likely more costly but allows very rapid recovery. The second provides greater flexibility and minimizes cost, but requires a longer time to recover, since the data must be exported and transported to a new location. While network transfers offer a third alternative that would provide a great deal of flexibility, transferring large amounts of data over the network, whether or not deduplicated, would likely either take too long or be too costly for most organizations.
Regardless of whether the recovery is at the same location as the cloud repository or in an alternative location, when the primary data center is eventually either restored or rebuilt, the organization will need to transfer the production data back to the primary data center. This can either be done in a slow-drip fashion using asynchronous replication and a long time period, or more quickly using an export-to-disk or export-to-tape process. Once the data is transferred or restored at the primary data center, a relatively brief resynchronization process will bring the data back to production-ready status.
Action item: Many organizations are looking to reduce or eliminate the need for tape and traditional backup processes. CIOs looking at next generation backup solutions should at the same time re-evaluate disaster recovery options and strategies. The combination of server and desktop virtualization, application-consistent snapshots, de-duplication, and asynchronous replication makes it possible to consider a cloud-based, off-premise backup of data for disaster recovery. That said, CIOs should pay particular attention to any limits on the range of supported applications and the process for delivering the application-consistent disaster recovery data sets to the recovery site. Co-locating recovery servers and cloud-based snapshots may be a best practice, but CIOs must also consider the method by which they will fail back to the primary data center, once it can again be placed in production.
On April 3, 2012, the Wikibon community was joined by Mike Adams, a storage specialist at Lighthouse Computer Services, who highlighted some of the key selection criteria he used in designing data protection for the company's cloud storage offering. From this discussion came a number of points of consideration for CIOs as they make their way through the backup and recovery decision process.
The data center has undergone massive transformation over the past decade. These stalwarts at the heart of the organization have morphed from racks upon racks of single workload servers with local storage to incredibly complex ecosystems consisting of highly virtualized workloads, shared storage, and critical interconnects. More recently, the rise of the cloud has created more opportunity to extend these business enabling data centers without regard for geographic location.
It’s been a truly revolutionary change in computing.
At the same time, one thing hasn’t changed: The need to back up the business data for which IT plays a stewardship role. Another item that hasn’t changed in many organizations: The use of tape as the primary backup mechanism.
With all the talk about eliminating the use of tape, one would think that every organization has finally eschewed the use of this aging backup and recovery technology. However, nothing could be further from the truth. Tape remains a common technology, particularly when it comes to long-term archive. Tape has proven to a truly resilient technology, but the cloud is beginning to present new opportunities that might finally drive a nail into tape’s coffin.
Don’t go it alone
Although CIOs should not individually make a final decision regarding backup, as the role with primary responsibility for this critical business service, CIOs need to gain an understanding for the full spectrum of options that are available in today’s market. With a broad and deep understanding of the options and of the needs and requirements of the business, the CIO needs to develop a recommendation that considers the wide variety of options.
The CIO should then create a recommended plan of action and work with the executive team — including the CEO — to make certain that all stakeholders in the business understand and agree on the strategy. The executive team has to own the decision.
Backup is not about backup
First, never forget that backup isn’t about backup at all. It’s about recovery — from an accidentally deleted file, a database failure, a complete hardware failure, or a site disaster. Recovery means different things to different people. That list provides a spectrum of recovery.
These days, the line between backup and recovery and full-on disaster recovery has blurred to the point that disaster recovery is simply a continuation of the recovery scale. As you move from the “light” end of the recovery spectrum to the heavy, the cost to implement recovery mechanisms also increases.
Don’t think you have to maintain the status quo
Understanding that backup and recovery are a spectrum, consider the range of options that you have at your disposal, including new opportunities presented by cloud providers. Here are some of the options you might consider:
- Sticking with tape. If that meets your business needs, this is a perfectly good, and inexpensive, strategy. Don’t let vendors convince you that tape is dead. The goal is to match your backup-and-recovery mechanism to business needs, not to implement some expensive new technology just because it's there.
- Moving to local disk. Disk-based backups are a pretty hot item and for good reasons on both the backup and recovery sides of the equation. Simply put, it’s fast. Whereas tapes can be notoriously slow, disks are orders of magnitude speedier and can shrink backup windows to a fraction of the existing service. One downside is that disk is ill-suited for long-term archival needs. This is why disk solutions are often coupled with other options, such as tape or cloud providers.
- Moving to the cloud. More and more companies are considering the cloud as a viable destination for protected data. DR requires geographic disparity between the data center and the data back-up. As bandwidth speeds increase, the cloud becomes more viable. However, Downloading terabytes of backed up data from a cloud provider can still take several days, while shipping the last tape backup from the archive location to the data center and restoring can often be done in a day or less. Oftentimes, cloud-based backup providers will load backed-up data to portable media and ship it to a customer that has experience a disaster and needs to recover. So when considering cloud backup, be sure you understand the procedure and schedule for recovery of a full database copy.
Action item: The new technologies of the last decade are creating new backup-and-recovery options. Today, a server can be recovered in seconds using the snapshot capabilities included in every hypervisor on the market. Storage replication technology can be used to provide multiple instances of source data without administrators intervening constantly. These product-focused data protection features can be mixed and matched with a number of data protection and recovery options so that an organization can get a data protection mechanism customized to the unique needs of the business.
Simplification by converging infrastructure components and the IT organization that supports them has a significant payback. In a recent case study on extreme simplification, the business benefits of taking such an approach were shown to be a reduction of 1/3 or more of the IT budget. The converged components selected were:
- Server Virtualization using HP blades with IO virtualization,
- NetApp Storage virtualization network and snapshot backup software,
- Microsoft Hypervisor, VDI, OS and Application Suites,
- Communication network between two sites for backup, recovery and archive, using the snapshot backup software.
There are many alternative ways of skinning this cat. Cisco, EMC, HP, and others have converged infrastructure combining servers, storage, and networking. Actifio (the subject of a recent Peer Incite led by one of its resellers, Lighthouse Computer Services), Asigra, and others offer local backup appliances and cloud storage options.
A different set of boundaries were drawn in a very interesting recent Peer Incite on achieving hyper-productivity through DevOps, the combining and integration of development and operations in a single cross-trained group.
Choosing from these alternative approaches and other emerging options, and drawing integration boundaries that are suitable for an IT organization and the business it serves, is and will continue to be an extremely important process. The advice from successful early adopters is very useful, summarized from the two references above:
- Choose approaches which:
- Deliver business value,
- Excite staff,
- Utilize high levels of skill in staff;
- If tasks don't meet these three criteria, they should either not be initiated or outsourced;
- Constantly experiment with small projects:
- Create a “skunk-works” fund;
- Choose leaders that can both initiate and kill projects;
- Virtualize all part of the IT infrastructure (servers, IO, storage, networking (as networking virtualization matures), desktops (for low-mobile users)and device virtualization (for data access for mobile multi-device users):
- Separate software and processes from specific hardware constraints through virtualization.
- Simplify all the components and sub-components in the data center ruthlessly:
- Reduce the number software and hardware vendors:
- Do not make the highest functionality the most important criterion in evaluating a vendor or product unless the functionality directly leads to simplification.
- Snapshot technology is an important enabler of backup, whether to disk or tape. Snapshots enable versioning and the ability to backup while applications continue to process new data.
- There is an enormous difference between crash-consistent (volume-based) backups and application-consistent backups. With crash-consistent backups, the amount of data that will have to be recovered or regenerated may be as far back as the last backup.
- Be wary of setting different service levels (RPO) for application-consistent snapshots, particularly when applications are highly interdependent in support of a business process. Ultimately, backups should be business-process aware.
- Do not trust any vendor who claims that they can rid of tape.
- Sure, eliminate tape recovery processes that should be replaced by disk;
- Data volume is rising as fast if not faster than bandwidth capabilities;
- A truck full of tapes give an order of magnitude more bandwidth for an order of magnitude less cost, and will maintain the relative advantages into the foreseeable future;
- New tape technologies such as LTO-5 and LTFS are enabling low cost solutions for many archiving applications
- Even with cloud solutions, tape should be part of the total solution (e.g., Amazon and Google recovery of data from tape after loss of disk copies)
- Develop clear metrics for measuring success such as:
- Reduction of the IT budget,
- Increase in the number of changes made (value of implementations or updates, with an emphasis on number),
- Percentage of changes backed out (quality of integration).
Action item: Simplification infrastructure projects need strong internal leadership and a strong interaction with the businesses that IT serves. External advice should be sought from IT services that have a track record of success and are independent of hardware and software suppliers.
At the April 3 Peer Incite the Wikibon community had the good fortune to learn about how Mike Adams, a storage specialist at Lighthouse Computer Services designed data protection for his company's customer-facing cloud storage offering. The discussion made it clear that the organization as a whole — not just the CIO — owns the data protection decisions. Senior executives need to pay close attention when it comes to data protection and recovery, particularly as it relates to full disaster recovery.
Imagine you’re the CEO of a medium-sized business. Now, imagine that you get a phone call at 3AM from the police with a message that the fire department is at your business, which has just burned to the ground. Your first thought will be, "Is anybody hurt". Your second will be "How do I get operational again, and how long will that take."
But We’ve Always Done it This Way
Traditionally, the CIO or the IT department as a whole has had primary responsibility regarding data backup and recovery decisions. Personally, I’ve seen a lot of IT departments that have ‘’’sole’’’ responsibility for these critical services. At some, primary decisions regarding backup and recovery have even fallen to the administrator responsible for handling the tape backup process. It was left to this single individual to determine what could be the fate of the organization.
This has most often been true when it comes to protecting data at the local level. After all, it’s IT’s responsibility to recover quickly from what could be considered minor issues, such as accidentally deleted files, failed servers and the like.
Going back to the scenario, if you’re a CEO who has left responsibility for backup retention policies and methods to someone in IT, you may be disappointed to learn that it will take five times as long to recover from this disaster as you expected and some data will have to be pulled from last week’s backup set, because this week’s was lost in the fire.
It’s at this point when you find out that “we’ve always done it this way” is going to mean major problems for the company.
Data is Only the Beginning
As a CEO or other senior executive, never forget that your IT department is working diligently to protect your company’s data. Unfortunately, the data is just the tip of the iceberg when it comes to recovering from a disaster. I see disasters as a spectrum. At the left hand side of the graph, you have low impact “disasters” such as accidentally deleted files. These kinds of issues are generally really easy to recover from. At the right hand side of the graph, you have high impact disasters which are extremely difficult from a recovery perspective.
Really, all of this is disaster recovery. To the person who accidentally deleted a file, a disaster has occurred. That’s why I consider the full spectrum a disaster scale.
When it comes to recovery, you need to think about more than just data, however. For each successive move to the right on the disaster scale, recovery becomes more difficult. In the full-on disaster of our scenario, which is close to the right end of the scale, you first need to reimplement business processes, then find a new physical location for your office and data center, then buy and install everything from office furniture to servers, before you have anything to restore the data to. In other words, you need to consider how you’re going to resume business once you’ve recovered your data.
While your IT department will be able to master the data recovery element, it takes a group effort to ensure that business can continue as usual. Sure, you might choose to have your CIO lead this group effort, but this individual will work with a team that includes a broad array of people from across the organization. I’ve written previously about why IT governance is so important in an organization. When considering the range of options that are available for protecting data and the business, your existing governance structure may provide a ready-built framework for this collaboration.
How much insurance do you want?
Every organization needs to come to grips with a couple of key questions:
- What is your tolerance for data loss?
- What is your tolerance for business interruption?
Bear in mind that saying “None” to both of these questions will mean spending vast sums of money to put into place systems and processes that can achieve your zero data loss, zero interruption goals. Data protection and disaster recovery are insurance policies, so think of it in those terms. As you add features to the insurance policy, you also add new costs.
If you can determine how much downtime costs on an hourly basis, you can quickly determine how much insurance you want to buy to minimize interruptions of any kind.
CIOs: Don’t wait… act
If you’re a CIO reading this article, here’s my advice: Engage your executive team as soon as possible and make sure that every single one of them is aware of exactly what data protection and disaster recovery mechanisms you have in place right now. Further, work with that group to determine how much insurance the company wants and start creating solution options that meet those marks.
Action item: As a CEO or other senior executive, make sure that you take an active role in understanding how the business protects data and the business itself. Working with the CIO, take steps to understand the range of options that are available. Then, if that dreaded 3AM call does take place, your second thought can be “we’ve got a plan for that”, and, even if you don’t necessarily rest easy, at least you know that the business is well taken care of.
At the April 3, 2012 Peer Incite we heard a cloud service provider, Lighthouse Computer Services, put forth a cloud-based data protection solution with a value proposition that eliminates tape. As Data Domain so effectively marketed for years, "Tape Sucks!", and a message that eliminates tape is alluring. But the reality is that getting rid of tape has risks that CIOs need to consider. The question is, as a vendor, how much of a responsibility do you have to convey those risks to your clients?
In the case of Lighthouse the situation is nuanced. Specifically, the data protection service being marketed, which is based on an Actifio solution, is primarily targeted at small- and mid-sized businesses. These businesses often have limited or no disaster recovery strategy and have relied for years on tape as the sole backup medium. Their tape systems are unreliable, cumbersome and expensive.
As disk-based backup has become more prevalent, for many firms, tape can be eliminated in theory, especially because many current DR plans are so lacking. In reality, however, tape or some other removable medium should be used not only as a deep archive for compliance but as an offsite last disaster resort. Without an off-site backup, SMBs are very vulnerable to fairly common disasters such as a major fire in their main office, one of the most common causes of small business failure.
As a vendor, how much should you capitalize on your client's lack of a coherent DR strategy to sell a solution versus risking slowing down the sales process to help your client really think through a DR strategy? It's not a black and white answer.
Here's the reality of the cloud. Moving data takes a long time -- several days in the case of large databases -- and if you have to move data to another location to recover from a disaster then you'd better realize it's going to take that time. The fastest, highest bandwidth and probably cheapest way to move lots of data is still to load a truck up with tapes and drive it somewhere. Old mainframers call it CTAM - the "Chevy Truck Access Method." We all like to visualize the cloud as this place where I can store huge volumes of data and move it around when I need it, but in real life speed-of-light physics and exorbitant telecommunications line costs make this impractical.
So where does this leave the security manager and CIO? In the case of disaster planning, either you're going to need some type of removable medium like tape or a removable disk solution, such as that used by Ares Management, or some type of premium/enhanced DR service such as that offered by Lighthouse, where applications can access data from a remote site in a two-site, semi-active:active data center scenario. Your business objectives and budget will determine the right choice for you. As always, there's no free lunch, and "getting rid of tape" is not a business goal (especially when talking about data protection), so be careful what you wish for.
Back to the central question: What responsibility does the vendor have to convey the risks and provide a full picture to its clients? As an advocate for practitioners, the Wikibon community would say vendors must bear a large part of that responsibility. But the reality is in today's world, the buyer has to be savvy enough to ask the right questions, talk to peers, and ultimately make the right decision. This should not be a surprise to buyers -- "Let the buyer beware" is a phrase that goes back to the Roman Empire.
Nonetheless, we believe that while hyperbole may sell products in the short term, it can damage long-term relationships. The Wikibon community believes that delivering a valid use case where the desired result is possible is the vendor's responsibility, and vendors must be careful to highlight the exceptions. Without that the vendor's pipeline may fill with overselling, but customer satisfaction will ultimately decline. Overselling is particularly notable when dealing with customers’ stated goal to “get rid of tape.” A balanced approach would be for the vendor to ask good questions such as:
- Why do you want to get rid of tape?
- Have you conducted a business impact analysis?
- What is your DR plan, and what role does a removable medium play in this plan?
- Which parts of the organization are involved in creating the plan?
- How will you test your disaster recovery?
All these questions will lead to opportunities for the vendor to better understand your customer, provide consulting services that will reduce your client's risk, share in the success of your customer and ultimately drive more business for your firm. Disaster recovery and data protection (backups) are insurance that customers hope never to use, but precisely because it is insurance, the standard needs to be set higher. Nothing is worse than buying insurance that couldn't deliver the desired result in a disaster. While we recognize the need to sell, if you're only solving one small part of the problem, you'll be serving your customer if you work with the ecosystem to communicate the full picture.
Action item: Vendors must balance the need to sell with the needs of their customer, especially when it comes to data protection. While the allure of disk-based backup solutions is very compelling, buying decisions can have ripple effects for a company's DR plan, and the vendor must bear some of the responsibility for assisting its customers in thinking through the implications of a disaster scenario. The old cliché of trusted advisor is most important when it comes to protecting data and the sales rep that helps protect its customers from true disasters will ultimately prosper in the long run.
As Scott Lowe points out in his article “C-levelers: Be an Active Participant in Your Backup and Recovery Strategy”, a cloud backup and recovery strategy in which the relevant applications and data are run from the cloud by a backup service provider such as Lighthouse Computer Services has advantages. It provides fast recovery that can be accessed from any location and that is independent of the local situation – that is, it works even if the local site is totally inaccessible or completely destroyed. The tradeoff of course is cost – these are premium services with a significant price tag.
Therefore the question comes down to the business value of the service versus its cost. And that is a CFO-level decision that can only be made in the context of the overall business and, in the case of a major disaster, its plan for business recovery.
However, cloud backup and recovery services do also provide benefits to IT, which could tip the scales if a decision hangs in the balance. It offloads emergency procedures, and associated expenses, from your data center to the service provider at a time when your staff may well be occupied with other pressing issues.
A core database failure, whether the problem is data corruption, a simple hardware failure, or a company-wide disaster such as a fire, is an “all hands on deck” emergency. Suddenly your staff is faced at minimum with a complex restart of a vital business system such as Microsoft Exchange or the company ERP system under huge pressure from the business, which needs access now! And depending on the cause, this may be complicated by the need to move the application to whatever hardware is available, which may be a smaller, slower system, until a replacement hard drive or other component can be installed. Not the best of situations.
At such moments, being able to call the service vendor and ask, “How soon can you have our database up and available in the cloud?” can move all that pressure off the staff. Now instead of an emergency that might require expensive one-off purchase and overnight shipment of new hardware and staff overtime, or more drastic and expensive action depending on the nature of the problem, your staff can take the time to do things right during normal business hours, knowing that the vital business services are already available via the cloud. The question may become whether the company ever wants to move the application back in-house at all.
Another advantage comes in testing. Too often SMBs in particular never dare test their DR plan, with the result that it fails when they really need it. CIOs who do run these tests always describe the moment when they actually pull the plug on the main system and cut over to the backup as a heart-stopper. A cloud solution also should be tested, but in this case the test just involves a call (to the service provider) and a stopwatch (to measure the actual time between that call and the moment when the application goes live in the cloud). Meanwhile the internal system can remain running normally.
Action item: When considering moving from a tape or other manual backup system to a cloud solution, definitely also consider subscribing to the service provider's premium service that can provide online access to core business applications and data via the cloud in an emergency. And while the main deciding factor in the decision must be service cost versus the cost to the business of 24-hours or more without access to the business functionality that database supports, keep in mind also that this will automate not only scheduled data backups but also disaster recovery of the systems, allowing you to eliminate those procedures, including dry-run drills, as well as the extra internal expense of responding to an emergency.