In remote office/branch office environments, architecting backup and recovery (including DR) as a service allows an organization to improve productivity, reduce its exposure to data loss and increase flexibility by accommodating and enforcing different policy requirements across the organization.
This was the premise of the November 17, 2009 Wikibon Peer Incite Research Meeting where the community gathered with Justin Bell, network engineer at Strand Associates. Headquartered in Madison, WI., Strand is a multi-discipline engineering design firm with nearly 400 employees and 11 offices in the United States. Strand manages about 15TB of active data with five full time IT staff members.
Historically, Strand, like many organizations, treated backup in remote offices as a matter of inconvenience. The firm used direct-attached storage within its remote servers and backed up to tape. Remote office technicians and admins were responsible for performing the backups, which were done manually according to a regular schedule. Strand’s IT staff would set the policy and hope the remote office personnel would complete the jobs.
The remote office staff were disaffected by the need to manage backup and recovery, and the situation at Strand resulted in four additional undesirable outcomes:
- Only 40% of backups were successfully completed.
- Restores were a nightmare.
- Productivity of engineers and the organization dramatically declined when data was lost.
- The firm was exposed to a disaster without adequate protection.
The Need for Change
Bell set out to architect a solution to this exposure. He focused on three main business requirements:
- Recovery point objectives (RPO),
- Recovery time objectives (RTO),
- The need for automation to lower costs and improve the quality of data protection.
The current recovery point at the firm was measured in weeks and the recovery time could be days following an outage. Bell went to the CFO and explained the situation, the risks, and the costs of fixing the problem. For a firm with nearly 400 employees, the exposure was simply too high ,and a roughly six-figure fix was a no-brainer.
The goal, which Strand has achieved, is to reduce the RPO from days/weeks to one hour and the RTO to minutes. Strand also wanted to ensure that 90% or more of backups succeeded, a goal it has also achieved.
Bell and his team architected a data protection service that used an iSCSI SAN in the remote offices with Falcon FalconStor’s CDP solution. The system deploys an agent on the primary server that copies data to SATA drives on the iSCSI CDP array using FalconStor’s asynch write splitting technology. The CDP system takes hourly snapshots using a continuous data protection methodology, which gives Strand a one hour RPO.
Falconstor technology uses a patented Microscan technology that reads data at the sub-block level. As a result, the actual amount of changed data copied is minimal, reducing the need for data de-duplication technology.
The remote sites push the changes nightly to Strand’s Madison headquarters where 35 days of snapshots are maintained for the remote sites and one year’s worth of data is stored offsite on tape.
The key to this architecture according to Bell is the ability to have two complete sets of data locally and a copy of data at the Madison HQ. Tape is the fourth line of defense in this system. One result is that when a file is corrupted, it can be restored locally from the last snapshot in a few minutes, saving hours of time reconstructing work that otherwise would be lost. For DR purposes, the main office sends copies of its backups to the company's largest remote office.
The bottom line is that this approach delivers business value beyond backup, making the disaster recovery solution essentially free while solving its backup and recovery problems. And when a server is lost in a remote site, the data can be loaded onto a backup in the home office and made available over the network within an hour as a temporary solution until the server can be repaired or replaced.
Strand has performed several hundred restores since installing the system and they've all met the RPO and RTO requirements set forth in the architecture and planning phase.
The main issue Strand faced in deploying this solution was seeding the initial backup. In order to get an initial master copy of record, Strand had to do an initial full backup at each of the remote sites. This process took up to a week and a half and required some scripting using the FalconStor and system consoles to throttle up/down the initial backup depending on time of day.
Advice to Peers
Bell advises that understanding the business value is an important factor in selling the solution to the CFO. Specifically, Bell laid out the current situation, explained the risks and focused on how to minimize lost productivity during outages. By thinking through the solution, he was able to demonstrate to the CFO how a disaster recovery solution could be architected as part of the remote office backup system.
As well, Bell advises practitioners to plan for snapshots in such an architecture and assume they need roughly 50% more capacity than expected. This figure will ultimately depend on the type of data being backed up, so understanding data types is important. For example, databases will consume more capacity during snapshots.
Finally, planning for the initial seeding is an important step. Users need to think through this activity in remote office backup environments and set expectations for personnel involved.
Action Item: Ninety percent of remote office/branch office environments lack adequate data protection processes and technology which unduly burdens remote staff, hurts organizational productivity and exposes companies to undue risks. IT heads need to architect sensible remote office/branch data protection services that automate backup and recovery to improve quality of service, lower organizational risk and improve productivity.