Failproof data backup and recovery is more critical to an organization’s survival than ever. With so much reliance on electronic data, an organization could virtually lose everything if disaster strikes, including millions of dollars associated with lost data, its competitive advantage, and even its credibility, such as in cases of security breaches. According to the U.S. Department of Labor, 93% of companies that experience a significant data loss will be out of business within five years. Implementing a failproof backup and recovery capability will protect an organization from data loss and downtime as a result of any of the following: hardware or software failure, power failure, natural disaster, or human error. There are two fundamental considerations when implementing a failproof backup and recovery capability: how quickly the organization needs to recover the data and how much data it can afford to lose. The challenge is finding the balance between data protection/recovery and the amount of investment required. This research note will provide guidelines to help make this determination.
Fail proof backup and recovery capability
In order to set a benchmark for a failproof backup and recovery capability, this article uses an example of a “worst-case scenario,” meaning an organization that can tolerate very little down time or data loss. This provides a means to easily adjust the capability to other organizations. In this case, a failproof backup and recovery capability is based on an organization, such as a bank, with the following characteristics:
- $1 billion in annual revenue
- 4,000 employees
- 60 branches
- 240 IT department employees
- $60 million IT budget
- 300 terabytes of data; 25% mission-critical
The key to a failproof backup and recovery capability is the ability to successfully failover and failback with very little data loss-- failback being the more difficult to accomplish. Therefore, the capability must be carefully balanced with a combination of asynchronous and synchronous replication and performed both onsite and in a remote location. Asynchronous replication allows data to be transferred and stored at virtually any distance from the primary data site. However, having the remote site too far away from the primary site increases telecommunications costs, and makes recovery much more difficult. It is best that a remote site be approximately 200 miles away – a far enough distance from the impact of a disaster that may strike the primary site, but close enough so that the data is easier to physically recover. The downside to asynchronous replication lies in the integrity of the data because of the lag time between the primary and remote sites, as the remote site will not be able to pick up instantly at the point where the primary site stopped. This creates the potential for data loss at the remote site. The benefit of synchronous replication is that it provides virtually zero data loss and easy recovery. However, it can only be conducted at a maximum distance of approximately 50 miles – not far enough to avoid the impact of a disaster. Between the two approaches and using logs and time stamps, increases the chances of a successful failback capability by decreasing recovery time and decreasing data loss.
Specific operational goals of implementing failproof backup and recovery
Successful implementation of a failproof backup and recovery capability will: 1) minimize interruptions to the normal operations; 2) limit the extent of data loss; 3) avoid security breaches; 4) minimize financial impact of the interruption; 5) establish alternative means of operation in advance; 6) provide efficient and timely restoration of operations; 7) ensure that the capability evolves to meet the organizations growth; and 8) comply with industry regulations.
The likely investment to implement a failproof backup and recovery capability as described above is approximately 20-25% in capital expenditures of the annual IT budget, or $6 million to $7 million. Yearly maintenance is approximately 5-10% of the budget, or $320,000 to $400,000. The time to create a failproof backup and recovery capability is approximately 3 months of planning and design and 3-6 months to implement. Additional goals include:
- Agreeing the maximum sustainable loss for the organization, or the amount of money that the organization can loose in a disaster and still survive
- Ensuring that the maximum loss of a disaster is lower than the maximum sustainable loss (the maximum loss is some multiple of the actual loss in a disaster (2-3x)
- Ensure that the expected loss in a disaster can be sustain and less that 1/3 of the maximum sustainable loss
- Ensure that the investment in failproof disaster recovery is sufficient to reduce the maximum loss below the maximum sustainable loss
- Ensure that investment in any further reduction in expected loss is cost justified (meets the return on investment criteria of the organization
Risks of implementing fail proof backup and recovery
The associated risks of implementing a failproof backup and recovery strategy are:
- Not fully understanding the financial losses due to data loss and either overspending or not implementing an adequate capability.
- The remote site is not far enough from the primary site to avoid impact of a disaster.
- The remote is not close enough to the primary site to physically and efficiently recover the data.
- The primary site resumes after an outage before the remote site is restored. This will leave a larger gap between replication, which increased chances of data loss.
- Corrupted data on the primary site is replicated on the remote site leaving both systems corrupted.
- Inconsistent testing of the capability to ensure it is working.
- Losing data while testing the capability.
- Inadequate documentation of system configurations and contents of backup tapes.
- Inadequate procedures to expeditiously rotate data off site.
- Inadequate testing procedures that test failover but not failback, giving management a false sense of security that the system is truly fail proof.
The fail proof backup and recovery initiative
Expectations (Out-of-scope)
In order for a successful failproof backup and recovery initiative, the following factors need to be in place:
- A full understanding of how the organization’s operations could continue at some level should data be inaccessible, and what is the lowest level of operations.
- A comprehensive understanding of all the possible scenarios in which data could be lost or temporarily inaccessible.
- Insight into the growth plans for the organization and how will the growth plans affect the data. For example, are there plans to open new facilities, acquire or merge with other organizations, increase staffing, or sell portions of the company?
Analyze phase
This phase includes an examination of all data and how loss of various data or downtime will impact operations. The analyze phase also includes determining the key disaster parameters based on business value to better understand the required configuration of the capability.
Acceptance Test Considerations
The analysis phase will be completed when the sponsor fully understands the parameters, goals, risks and requirements of the capability.
Key analysis milestones
This phase should take about 6-12 weeks and 30-60 person days of effort.
1. An effective sponsor of the initiative is identified - It is important that the sponsor can resolve any organizational issues, and has a familiarity with risk metrics and methodologies
2. Data collected - Determine the key disaster recovery parameters. As mentioned, to determine the failproof backup and recovery capability, an organization needs to understand acceptable down time in case of a disruption of operations – the latest point in time at which the business operations must resume after disaster. This is known as “Recovery Time Objective” (RTO). It is used in conjunction with “Recovery Point Objective” (RPO), which is the point in time to which data must be restored in order to successfully resume processing. This is the time between last backup and when outage occurred and indicates the amount of data lost. Using the example organization described above, here are the metrics used to understand the probable loss to help set the RTO and RPO for the capability:
a. Chance of disaster (based on insurer data): 5% annually
b. Cost of downtime per hour: $3 million
c. Cost of data loss per hour: $5 million
d. Cost of downtime for 96 hours (RTO) at remote site: $288 million (RTO x b)
e. Cost of data loss for 12 hours (RPO) onsite: $60 million (RPO x c)
f. Expected loss: $17.4 million ((d+e) x a)
g. Maximum loss is 2 x loss ((d + e) x 2): $696 million It is highly unlikely in this case the maximum loss of $700m could be sustained by an organization with a $1billion in revenue. Action must be taken to mitigate such an exposure and government regulatory bodies (the SEC in the case of a bank) would put great pressure to ensure the risks to investors and customers were reduced.
3. Data Analyzed
- RTO and RPO targets can then be adjusted to ensure that the maximum sustainable loss is greater than the maximum loss.
4. Business case constructed:
- Analyst constructs business case / cost benefit analysis detail of alternative scenarios - Recommend the best alternative to the business
5. Initial Design and business case accepted by sponsor and any other stakeholders necessary
Design phase
Acceptance test considerations
The design phase is complete once the stakeholders agree that the design plan will meet the goals and requirements of the capability, the RFP has been issued and the key hardware, software and telecom vendors are selected.
Key design milestones
This phase should take about 10-14 weeks and about 60-person days of effort.
1. Primary vendor decided
Decide on vendor hardware and software technologies available and issue RFP/solicit bids
- EMC, HDS, HP, & IBM would be the primary vendors to consider
- Determine telecommunication requirements and issue RFP/solicit bids
2. Disaster recovery procedures designed
- Design procedures around hardware and software decided and design integrated with current procedures
- Pay particular attention to the criteria for failing over to the remote site
- Determine training requirements for operations
- Design test procedures and scripts
Deploy phase
Acceptance Test Considerations
The deploy phase will be completed when:
- The backup and recovery capability is built, tested, and brought into service
- A fail-over fail back test is run within 12 months of cutover
- The operations group is fully responsible for all aspects of the installation
Key deployment milestones
This phase should take about 4-6 months and cost between $1.5 million & $8.0 million.
1. The backup and recovery capability is built
- Installation of storage hardware and storage management functionality
- Installation of the telecommunication facilities
- Update and creation of new process and procedures, with full documentation
2. The backup and recovery capability tested
- Testing of equipment, software, and procedures on historical data
- Testing of recovery on some non-mission critical live applications
- Testing of procedures for migration, backup, recovery, and disaster recovery
3. Migration & Cutover completed - Phased migration cutover
- Extensive monitoring of performance, reliability, & telecommunication performance
4. The backup and recovery capability initiative wrapped up
- Procedures set up for monitoring performance and implementing yearly testing (including yearly fail-over and fail-back testing)
- Procedures set up for adding additional storage, storage functionality, and telecommunications bandwidth
- Final review of documentation
- All project staff released and full hand-over to storage operations
Initiative summary
As noted, a failproof backup and recovery capability will vary depending on the nature of the organization’s data and how much risk/financial loss the organization can tolerate. In any case, how the capability is implemented and managed is crucial to the health of the business. Management needs to make it a top priority – a critical management objective. This means that it is seen as part of the overall system – consistently tested, examined, and updated – the same as other business operations are regarded.