Tip: Hit Ctrl +/- to increase/decrease text size)
Storage Peer Incite: Notes from Wikibon’s November 17, 2009 Research Meeting
Too many companies today run huge risks with their vital data for lack of an adequate backup and disaster recovery system. This is particularly true in remote offices with no on-site IT staff. They view backup and recovery as a check-off item and rely on ancient (by IT standards) manual tape systems that are never tested. They are too awkward to be used for recovery of individual damaged files, which means that no one really knows if they are working until an office fire, flood, or simple disk drive failure wipes out the entire database. Then they dig out the last backup tape, put it in the reader, and hope that it works. Too often these organizations get a rude shock, and what started as a fairly routine disaster becomes an event that threatens the survival of the business.
The fact is that today there is no excuse for this kind of practice. New technologies have arrived on the market that provide reliable, automatic continuous local file backup and simple, nearly instant restore of a damaged file, with overnight backup of daily and weekly snapshots over the network to a central site. In the event of a local disaster, the full functionality of the local server can be restored over the network within hours using a spare server in the central site, maintaining the business until the local site can be restored.
Admittedly this does require a capital investment, and in this economy it can be hard to sell that to the CFO. Or is it? In this newsletter Justin Bell of Strand Assoc. describes exactly how he sold exactly this kind of investment to his CFO partly on the basis of providing protection against disaster and partly on the value of the productivity improvements that the new backup solution provides by eliminating the need for redoing work lost when a file is damaged. Even in the present economy, this provides a compelling financial argument for adequate protection. G. Berton Latamore
In remote office/branch office environments, architecting backup and recovery (including DR) as a service allows an organization to improve productivity, reduce its exposure to data loss and increase flexibility by accommodating and enforcing different policy requirements across the organization.
This was the premise of the November 17, 2009 Wikibon Peer Incite Research Meeting where the community gathered with Justin Bell, network engineer at Strand Associates. Headquartered in Madison, WI., Strand is a multi-discipline engineering design firm with nearly 400 employees and 11 offices in the United States. Strand manages about 15TB of active data with five full time IT staff members.
Historically, Strand, like many organizations, treated backup in remote offices as a matter of inconvenience. The firm used direct-attached storage within its remote servers and backed up to tape. Remote office technicians and admins were responsible for performing the backups, which were done manually according to a regular schedule. Strand’s IT staff would set the policy and hope the remote office personnel would complete the jobs.
The remote office staff were disaffected by the need to manage backup and recovery, and the situation at Strand resulted in four additional undesirable outcomes:
- Only 40% of backups were successfully completed.
- Restores were a nightmare.
- Productivity of engineers and the organization dramatically declined when data was lost.
- The firm was exposed to a disaster without adequate protection.
The Need for Change
Bell set out to architect a solution to this exposure. He focused on three main business requirements:
- Recovery point objectives (RPO),
- Recovery time objectives (RTO),
- The need for automation to lower costs and improve the quality of data protection.
The current recovery point at the firm was measured in weeks and the recovery time could be days following an outage. Bell went to the CFO and explained the situation, the risks, and the costs of fixing the problem. For a firm with nearly 400 employees, the exposure was simply too high ,and a roughly six-figure fix was a no-brainer.
The goal, which Strand has achieved, is to reduce the RPO from days/weeks to one hour and the RTO to minutes. Strand also wanted to ensure that 90% or more of backups succeeded, a goal it has also achieved.
Bell and his team architected a data protection service that used an iSCSI SAN in the remote offices with Falcon FalconStor’s CDP solution. The system deploys an agent on the primary server that copies data to SATA drives on the iSCSI CDP array using FalconStor’s asynch write splitting technology. The CDP system takes hourly snapshots using a continuous data protection methodology, which gives Strand a one hour RPO.
Falconstor technology uses a patented Microscan technology that reads data at the sub-block level. As a result, the actual amount of changed data copied is minimal, reducing the need for data de-duplication technology.
The remote sites push the changes nightly to Strand’s Madison headquarters where 35 days of snapshots are maintained for the remote sites and one year’s worth of data is stored offsite on tape.
The key to this architecture according to Bell is the ability to have two complete sets of data locally and a copy of data at the Madison HQ. Tape is the fourth line of defense in this system. One result is that when a file is corrupted, it can be restored locally from the last snapshot in a few minutes, saving hours of time reconstructing work that otherwise would be lost. For DR purposes, the main office sends copies of its backups to the company's largest remote office.
The bottom line is that this approach delivers business value beyond backup, making the disaster recovery solution essentially free while solving its backup and recovery problems. And when a server is lost in a remote site, the data can be loaded onto a backup in the home office and made available over the network within an hour as a temporary solution until the server can be repaired or replaced.
Strand has performed several hundred restores since installing the system and they've all met the RPO and RTO requirements set forth in the architecture and planning phase.
The main issue Strand faced in deploying this solution was seeding the initial backup. In order to get an initial master copy of record, Strand had to do an initial full backup at each of the remote sites. This process took up to a week and a half and required some scripting using the FalconStor and system consoles to throttle up/down the initial backup depending on time of day.
Advice to Peers
Bell advises that understanding the business value is an important factor in selling the solution to the CFO. Specifically, Bell laid out the current situation, explained the risks and focused on how to minimize lost productivity during outages. By thinking through the solution, he was able to demonstrate to the CFO how a disaster recovery solution could be architected as part of the remote office backup system.
As well, Bell advises practitioners to plan for snapshots in such an architecture and assume they need roughly 50% more capacity than expected. This figure will ultimately depend on the type of data being backed up, so understanding data types is important. For example, databases will consume more capacity during snapshots.
Finally, planning for the initial seeding is an important step. Users need to think through this activity in remote office backup environments and set expectations for personnel involved.
Action item: Ninety percent of remote office/branch office environments lack adequate data protection processes and technology which unduly burdens remote staff, hurts organizational productivity and exposes companies to undue risks. IT heads need to architect sensible remote office/branch data protection services that automate backup and recovery to improve quality of service, lower organizational risk and improve productivity.
Disaster Recovery is a difficult subject to bring up with most CFOs. It's easy to lose the focus of a business person while trying to convince him to spend a large sum of money to help mitigate the effects of some type of disaster that may or may not happen sometime prior to the present investment becoming outdated. Simply put, most CFOs are more concerned about surviving the next quarter than about surviving an unpredictable and possibly unlikely disaster scenario. The solution: Preventing the glossed over gazes from C-level executives when describing a DR system is as easy as saying PRODUCTIVITY.
The new solution we proposed automated the entire backup procedure, eliminating the need for engineers at remote sties to perform backups manually, freeing their time for their primary responsibilities. Simultaneously, it was much more reliable than the old manual process. Prior to the project, only 40% of Strand’s backups succeeded. Subsequent to the project Strand experienced a 90% success rate.
Decreasing the over-all impact of a disaster is the key purpose of any DR project, thus it should be the primary investment justification. The most important metrics to present are the RPO and RTO, but they need to be presented in clear business terms:
- RPO - The amount of completed work that will be lost and need to be redone.
- RTO - The amount of time it will take before our employees can start working after a disaster.
- Total Impact - RPO + RTO + Time it takes to re-do the lost work.
Typically, the man-hours lost between the disaster and the last recovery (RPO) were productive work hours that would have made the company money in some way. Additionally, there won't be any productive work completed until the systems are restored (RTO), so those hours are lost as well. The productivity cost of a disaster can then be calculated:
- Productivity Loss = RPO + RTO
Then the cost of the loss can be calculated by adding the hourly rates of all the individuals affected by the disaster:
- Cost = Productivity Loss x Total Hourly Rate of Effected Employees
Once the impact of a disaster is established in general terms, it is important to put those terms into real dollars. First, calculate the RPO of the current DR system based on a worst-case scenario. In Strand's case, there was an example where a tape drive had broken and wasn't replaced for three days. RPO was calculated from the time the last job finished (1:00 a.m. Monday) to the time the next job finished (1:00 a.m. Thursday), which represented an RPO of 72 hours, 24 of which were working hours. The RTO is the amount of time to acquire, rebuild, and deploy critical servers plus the amount of time needed to restore the associated data, which Strand calculated to be 29 hours, 8 of which were working hours. Then the cost of a disaster can be estimated in real dollars:
- Cost = (24 work hours lost + 8 work hours lost while recovering data) x Total Hourly Rate of Effected Employees
Using real-world examples to show why DR is important should lay the groundwork for approval of a DR project. Unfortunately there isn't much incentive for the CIO asking, "What will this project do for me now?"
In those cases you need to find and sell an added benefit: Immediate productivity gains that can be realized by 'Recovery as a Service.' Strand's IT staff now uses hourly snapshots on Falconstor's CDP device to perform restores for users. This added benefit increases both the efficiency of end-users and the IT staff. If an end-user makes a mistake in a file, or wishes to obtain an older version of a file, or if a file is corrupted, losing the latest work on it, she simply e-mails the help desk, and the IT staff can perform the restore in a matter of minutes. Prior to installing the Falconstor CDP solution, the IT staff could only provide a nightly restore point. If the end-user determined the nightly backup would be useful, the process required a member of the IT staff to attempt to locate tape that would include the information needed, load the tape, wait for the job to complete and hope that the tape had the correct data. Now the user simply states the nearest hourly snapshot, then a member of the IT staff mounts the snapshot, and copies the requested files into live storage. The whole process takes 10 mouse clicks. Thus the answer to the question "What will this project do for me now?" is the system will:
- Increase end-user productivity by reducing the time it takes to recover files.
- Increase end-user productivity by offering hourly snapshots and reducing the time it takes to redo lost/corrupt/erroneous work.
- Increase remote office staff productivity by reducing the time spent administering backups in lieu of dedicated IT staff.
- Increase IT productivity by reducing the time to administer backup and recovery.
That's what this project will do for you now.
Action item: Explaining the importance of RPO and RTO to the CFO is hard. The best way to sell a backup and recovery project to senior executives is to explain the current situation, identify the risks and explain the costs of mitigating that risk. In the case of Strand, the justification was a function of combining the benefits of increased productivity for recovery that occurred regularly with the mitigation of risk associated with disaster.
Vendors don’t usually mention that Continuous Data Protection (CDP) replication solutions for remote office servers and PCs that rely on network communication have a major challenge – sending the initial copy of the data to the target environment. It can take weeks, even months, to send the initial full seed copy of the data over a network. After the initial copy has been sent, incrementals can usually be sent overnight with no problems.
Strand explained in the Wikibon Peer Incite held on Nov. 17, 2009 that it took them 11 days to complete the initial data transfers. Stand looked at various options open to them to mitigate the problem:
- Create a mobile (tape or USB-based) copy of the data and ship to the backup site.
- This was not possible because the data was held in block mode on the target Falconstor system.
- Drive or ship the remote FalconStor appliance(s) to the central location, copy the data locally to the backup system, and drive/send the appliances back again.
- This was deemed to be impractical and too risky.
- Put in a short-term high-speed network.
- This was not possible to provide this to the Strand remote locations on a short-term basis.
Using the network was the only way, and this exposed another problem: ensuring that the backup data did not swamp the network and deny communication access to the engineers on site. Denying access would have reduced the productivity of the engineers and delayed engineering projects.
There was a function available on the remote FalconStor CDP appliance to throttle the amount of data that was transferred over the network. However, this had to be set manually. Central IT needed this process to be automatic (there was no remote IT staff to issue these commands), so the IT staff jury-rigged a solution to automatically initiate a script to throttle back communications during the 12-hour weekdays, and allow full speed overnight and weekends. The good news for Strand was that a restart of the backup process was not required at any of the remote sites.
IT practitioners need to remember that sending mobile media such as tape or USB storage devices will often provide higher bandwidth than communication lines!
Action item: IT shops implementing remote replication software (and individuals using cloud backup software on their PCs) need to calculate exactly how long the initial seeding copy is going to take and find a work-around strategy if necessary. The simplest work-around is a USB-based or other type of transportable storage that can be used to copy the data files locally and then be shipped. IT personnel should include seeding as an important part of the evaluation process for CDP replication solutions.
Strand is a multidisciplinary engineering firm headquartered in Madison, Wisc., with 11 widely distributed offices in Wisconsin, Illinois, Kentucky, Alabama, Indiana, Ohio, and Arizona. It has 12TB of live data across its offices, with another 6TB in active archives. This data has grown steadily, and Strand needed to re-engineer its time-consuming and increasingly cumbersome approach of backing up each site using individual tape backups and on-site staff resources. Statistics showed Strand had an unacceptable 40% success rate for backups completed prior to implementing a new backup architecture. Ensuring adequate backup protection and access to all of the file level, Microsoft Exchange Server, and Microsoft SQL Server data in order to meet RTOs was a major challenge for Strand's IT staff.
Strand was able to successfully architect a highly automated solution that provides continuous data protection (CDP) for mission critical data and handle more than 600 restores annually and the solution works well for them. From a broader customer's perspective, there are several ways to view the ever-present backup problem, and it begs some further questions from any organization when architecting a backup solution. Strand’s 12 TB of live data and 6 TB of archive data is a relatively small amount of storage, especially considering that the latest disk drives having 2 TB of capacity and the latest tape cartridges exceed 2 TB with compression. Their entire 6 TB archive could be contained on 3-4 tape cartridges. One way to simplify a cumbersome backup process is to reduce the total number of physical entities needed to contain this amount of data.
In addition, in the not too distant future, businesses will be presented with the option of backing up mission critical RTO sensitive data to flash disk drives using de-duplication to reduce the amount of flash needed and thus significantly lowering costs over the disk option. Flash will drastically improve recovery times (the RTO), since it is much faster than rotating disk for read operations. Flash is also more energy efficient than disk in this case. If going green is an important issue, using disk-based backup for long periods of time with minimal or no access is not as energy efficient or cost effective as some emerging technologies.
Customers also need to consider their IT geography when architecting remote office data protection. The presence of electricity in some locations is no longer a guarantee. In areas where electrical outages are increasingly common along the Gulf of Mexico and Atlantic Coasts and some areas of California, the issue of how to provide backup and recovery services if there is no electricity is important. Many locations don’t want removable media. However, if power availability and cost is an issue, and the challenge of how can you get your data safely out of the danger zone is an issue, removable media allows someone to simply put it on a truck.
Action item: Strand accomplished what they needed to do to meet their stringent requirements for remote backup. However, their environment may be different than others. There are many items to consider when building a compelling and sustainable backup strategy including the cost, energy, ease of operation, RTOs, availability, technology selection, bandwidth, geographical issues, and personnel. Secretaries and professionals managing backup and recovery in remote offices are not acceptable in the 21st Century and given the wide selection of backup solutions, tailoring a solid solution to meet your needs is clearly achievable.
Most continuous data protection solutions save byte or block-level rather than file-level differences. This means that if you change one byte of a 100 GB file, only the changed byte or block is backed up. Traditional incremental and differential backups make copies of entire files, as does file-level backup.
For any kind of backup or replication process, there is always a first time when the first copy of the data gets created. This must, of course, be a full copy, and the process is called seeding. For local copies this is usually not a problem. For remote copies, however, bandwidth becomes the limiting factor – it can take hours, days, or weeks. And it can interfere with other traffic on the network. In some instances, users have made the initial seed copy on tape, portable disk, or even whole appliances and then transported the copy to the central site to act as the seed.
However, using a portable copy does not typically work with block-based replication as there is no knowledge of the file system within the replicating system. Thus, while the blocks may be faithfully copied, their relation to a file system is lost, and the data is not usable. One exception is when a disk image can be made, and the target hardware is a close match to the source hardware.
Other approaches involve getting a temporary increase in bandwidth if it is available from the carrier(s), which is not usually the case.
So, users often just rely on seeding over the network and apply “intelligent” throttling to avoid impacting other traffic. QOS techniques can also be used. However, early versions of CDP products have no such intelligent throttling. They have either no provisions, crude provisions, or totally manual provisions. Sophisticated products, which usually have been in the field for quite a while, provide advanced policy-based automation for not only seeding but also for restarts, network outages, and bandwidth shortages.
Action item: Vendors should include sophisticated intelligent throttling in version 1.0 of their products.
A remote office backup and recovery strategy that relies heavily on human intervention by non-IT staff and manual procedures is a recipe for disaster if and when files and systems need to be restored – not to mention much more expensive and slower than alternative solutions that are available today. This premise was well articulated by Justin Bell, a Network Engineer working for multidisciplinary engineering firm Strand Associates, Inc. of Madison, Wisconsin during Wikibon’s November 17th, 2009 Peer Incite Research Meeting.
Bell was tasked with architecting a solution to replace Strand’s outdated legacy systems and processes that would meet or exceed their backup, recovery, and DR requirements while providing additional business value to the engineering and administrative staff.
Legacy remote office backup activities typically comprise outdated systems and processes including:
- Servers configured with local disk (directly attached),
- Tape drives that are dedicated to each server,
- Inadequate documentation,
- Inconsistent policy enforcement,
- Excessive human intervention.
Typically this process is prone to error as humans make mistakes for a variety of reasons - particularly when performing backups is not part of their primary job and they may not have been not properly trained. Recovery is even a bigger problem if the process is not highly automated, as when tapes need to be retrieved and mounted or when the recovery process needs to follow a specific sequence which business professionals are typically too busy or don’t want to do.
Bell’s remedy was to automate the process architecting an appropriate remote office backup and recovery solution that would specifically provide remote office/branch office backup and recovery services, software to automate the process and software, hardware, and infrastructure for their mostly Microsoft environment that allowed IT staff to manage the entire process (backup and recovery) remotely. After evaluating several vendor offerings, Strand chose Falconstor’s CDP solution which provides:
- Continuous local and remote data protection,
- Nonstop data availability,
- Storage mirroring, snapshot, replication,
- Remote management,
- No remote user intervention required.
Justification and Getting Rid of Stuff
Implementing this solution allowed Strand to avoid having to upgrade several servers and tape drives as well as freeing up remote employees to focus on their core job functions. Bell estimates that Strand was able to recoup 35% of the cost by eliminating the following list of items, many of which would otherwise have needed replacement to maintain the old tape-based system, :
- 308 tapes in remote offices,
- 7 tape drives,
- 7 backup exec licenses,
- 7 backup servers.
This figure does not include the 20 to 30 hours per month of productivity returned to the workforce through the automation process or the improved accuracy, speed, and reporting capabilities now available to IT staff.
Bell has not yet opted for the deduplication features that are now available with Falconstor and suggests that snapshots are typically 50% the size of the original data. Today, Bell estimates that Strand has 20 terabytes of storage under management and is not resource constrained. Other firms may also want to consider deduplication, compression, single instancing and other data reduction functions that will help to reduce speed of the expanding the overall storage footprint.
Action item: IT professionals should seriously consider replacing outmoded backup and recovery systems and processes in favor of solutions that reduce human intervention and offer additional Disaster Recovery (DR) capabilities as well as avoiding the cost of upgrading older equipment, while providing additional business value by improving worker productivity.