Implementing Fail Proof Backup And Recovery

Become a Member!

Why Register?

Login

Featured Research

Announcements

Technology Events

Home Profile Peers Wiki Activity Groups Feedback

Implementing fail proof backup and recovery

Currently 4.5/5 Stars.
1
2
3
4
5

rate this

Last Update: Jan 28, 2009 | 09:07

Viewed 55360 times | Community Rating: 4.5

Originating Author: David Vellante

Failproof data backup and recovery is more critical to an organization’s survival than ever. With so much reliance on electronic data, an organization could virtually lose everything if disaster strikes, including millions of dollars associated with lost data, its competitive advantage, and even its credibility, such as in cases of security breaches. According to the U.S. Department of Labor, 93% of companies that experience a significant data loss will be out of business within five years. Implementing a failproof backup and recovery capability will protect an organization from data loss and downtime as a result of any of the following: hardware or software failure, power failure, natural disaster, or human error. There are two fundamental considerations when implementing a failproof backup and recovery capability: how quickly the organization needs to recover the data and how much data it can afford to lose. The challenge is finding the balance between data protection/recovery and the amount of investment required. This research note will provide guidelines to help make this determination.

1 Fail proof backup and recovery capability
2 Specific operational goals of implementing failproof backup and recovery
3 Risks of implementing fail proof backup and recovery
4 The fail proof backup and recovery initiative

Fail proof backup and recovery capability

In order to set a benchmark for a failproof backup and recovery capability, this article uses an example of a “worst-case scenario,” meaning an organization that can tolerate very little down time or data loss. This provides a means to easily adjust the capability to other organizations. In this case, a failproof backup and recovery capability is based on an organization, such as a bank, with the following characteristics:

- $1 billion in annual revenue

- 4,000 employees

- 60 branches

- 240 IT department employees

- $60 million IT budget

- 300 terabytes of data; 25% mission-critical

The key to a failproof backup and recovery capability is the ability to successfully failover and failback with very little data loss-- failback being the more difficult to accomplish. Therefore, the capability must be carefully balanced with a combination of asynchronous and synchronous replication and performed both onsite and in a remote location. Asynchronous replication allows data to be transferred and stored at virtually any distance from the primary data site. However, having the remote site too far away from the primary site increases telecommunications costs, and makes recovery much more difficult. It is best that a remote site be approximately 200 miles away – a far enough distance from the impact of a disaster that may strike the primary site, but close enough so that the data is easier to physically recover. The downside to asynchronous replication lies in the integrity of the data because of the lag time between the primary and remote sites, as the remote site will not be able to pick up instantly at the point where the primary site stopped. This creates the potential for data loss at the remote site. The benefit of synchronous replication is that it provides virtually zero data loss and easy recovery. However, it can only be conducted at a maximum distance of approximately 50 miles – not far enough to avoid the impact of a disaster. Between the two approaches and using logs and time stamps, increases the chances of a successful failback capability by decreasing recovery time and decreasing data loss.

Specific operational goals of implementing failproof backup and recovery

Successful implementation of a failproof backup and recovery capability will: 1) minimize interruptions to the normal operations; 2) limit the extent of data loss; 3) avoid security breaches; 4) minimize financial impact of the interruption; 5) establish alternative means of operation in advance; 6) provide efficient and timely restoration of operations; 7) ensure that the capability evolves to meet the organizations growth; and 8) comply with industry regulations.

The likely investment to implement a failproof backup and recovery capability as described above is approximately 20-25% in capital expenditures of the annual IT budget, or $6 million to $7 million. Yearly maintenance is approximately 5-10% of the budget, or $320,000 to $400,000. The time to create a failproof backup and recovery capability is approximately 3 months of planning and design and 3-6 months to implement. Additional goals include:

- Agreeing the maximum sustainable loss for the organization, or the amount of money that the organization can loose in a disaster and still survive

- Ensuring that the maximum loss of a disaster is lower than the maximum sustainable loss (the maximum loss is some multiple of the actual loss in a disaster (2-3x)

- Ensure that the expected loss in a disaster can be sustain and less that 1/3 of the maximum sustainable loss

- Ensure that the investment in failproof disaster recovery is sufficient to reduce the maximum loss below the maximum sustainable loss

- Ensure that investment in any further reduction in expected loss is cost justified (meets the return on investment criteria of the organization

Risks of implementing fail proof backup and recovery

The associated risks of implementing a failproof backup and recovery strategy are:

- Not fully understanding the financial losses due to data loss and either overspending or not implementing an adequate capability.

- The remote site is not far enough from the primary site to avoid impact of a disaster.

- The remote is not close enough to the primary site to physically and efficiently recover the data.

- The primary site resumes after an outage before the remote site is restored. This will leave a larger gap between replication, which increased chances of data loss.

- Corrupted data on the primary site is replicated on the remote site leaving both systems corrupted.

- Inconsistent testing of the capability to ensure it is working.

- Losing data while testing the capability.

- Inadequate documentation of system configurations and contents of backup tapes.

- Inadequate procedures to expeditiously rotate data off site.

- Inadequate testing procedures that test failover but not failback, giving management a false sense of security that the system is truly fail proof.

The fail proof backup and recovery initiative

Expectations (Out-of-scope)

In order for a successful failproof backup and recovery initiative, the following factors need to be in place:

- A full understanding of how the organization’s operations could continue at some level should data be inaccessible, and what is the lowest level of operations.

- A comprehensive understanding of all the possible scenarios in which data could be lost or temporarily inaccessible.

- Insight into the growth plans for the organization and how will the growth plans affect the data. For example, are there plans to open new facilities, acquire or merge with other organizations, increase staffing, or sell portions of the company?

Analyze phase

This phase includes an examination of all data and how loss of various data or downtime will impact operations. The analyze phase also includes determining the key disaster parameters based on business value to better understand the required configuration of the capability.

Acceptance Test Considerations

The analysis phase will be completed when the sponsor fully understands the parameters, goals, risks and requirements of the capability.

Key analysis milestones

This phase should take about 6-12 weeks and 30-60 person days of effort.

1. An effective sponsor of the initiative is identified - It is important that the sponsor can resolve any organizational issues, and has a familiarity with risk metrics and methodologies

2. Data collected - Determine the key disaster recovery parameters. As mentioned, to determine the failproof backup and recovery capability, an organization needs to understand acceptable down time in case of a disruption of operations – the latest point in time at which the business operations must resume after disaster. This is known as “Recovery Time Objective” (RTO). It is used in conjunction with “Recovery Point Objective” (RPO), which is the point in time to which data must be restored in order to successfully resume processing. This is the time between last backup and when outage occurred and indicates the amount of data lost. Using the example organization described above, here are the metrics used to understand the probable loss to help set the RTO and RPO for the capability:

a. Chance of disaster (based on insurer data): 5% annually

b. Cost of downtime per hour: $3 million

c. Cost of data loss per hour: $5 million

d. Cost of downtime for 96 hours (RTO) at remote site: $288 million (RTO x b)

e. Cost of data loss for 12 hours (RPO) onsite: $60 million (RPO x c)

f. Expected loss: $17.4 million ((d+e) x a)

g. Maximum loss is 2 x loss ((d + e) x 2): $696 million It is highly unlikely in this case the maximum loss of $700m could be sustained by an organization with a $1billion in revenue. Action must be taken to mitigate such an exposure and government regulatory bodies (the SEC in the case of a bank) would put great pressure to ensure the risks to investors and customers were reduced.

3. Data Analyzed

- RTO and RPO targets can then be adjusted to ensure that the maximum sustainable loss is greater than the maximum loss.

4. Business case constructed:

- Analyst constructs business case / cost benefit analysis detail of alternative scenarios - Recommend the best alternative to the business

5. Initial Design and business case accepted by sponsor and any other stakeholders necessary

Design phase

Acceptance test considerations

The design phase is complete once the stakeholders agree that the design plan will meet the goals and requirements of the capability, the RFP has been issued and the key hardware, software and telecom vendors are selected.

Key design milestones

This phase should take about 10-14 weeks and about 60-person days of effort.

1. Primary vendor decided

Decide on vendor hardware and software technologies available and issue RFP/solicit bids

- EMC, HDS, HP, & IBM would be the primary vendors to consider

- Determine telecommunication requirements and issue RFP/solicit bids

2. Disaster recovery procedures designed

- Design procedures around hardware and software decided and design integrated with current procedures

- Pay particular attention to the criteria for failing over to the remote site

- Determine training requirements for operations

- Design test procedures and scripts

Deploy phase

Acceptance Test Considerations

The deploy phase will be completed when:

- The backup and recovery capability is built, tested, and brought into service

- A fail-over fail back test is run within 12 months of cutover

- The operations group is fully responsible for all aspects of the installation

Key deployment milestones

This phase should take about 4-6 months and cost between $1.5 million & $8.0 million.

1. The backup and recovery capability is built

- Installation of storage hardware and storage management functionality

- Installation of the telecommunication facilities

- Update and creation of new process and procedures, with full documentation

2. The backup and recovery capability tested

- Testing of equipment, software, and procedures on historical data

- Testing of recovery on some non-mission critical live applications

- Testing of procedures for migration, backup, recovery, and disaster recovery

3. Migration & Cutover completed - Phased migration cutover

- Extensive monitoring of performance, reliability, & telecommunication performance

4. The backup and recovery capability initiative wrapped up

- Procedures set up for monitoring performance and implementing yearly testing (including yearly fail-over and fail-back testing)

- Procedures set up for adding additional storage, storage functionality, and telecommunications bandwidth

- Final review of documentation

- All project staff released and full hand-over to storage operations

Initiative summary

As noted, a failproof backup and recovery capability will vary depending on the nature of the organization’s data and how much risk/financial loss the organization can tolerate. In any case, how the capability is implemented and managed is crucial to the health of the business. Management needs to make it a top priority – a critical management objective. This means that it is seen as part of the overall system – consistently tested, examined, and updated – the same as other business operations are regarded.

Comments on 'Implementing fail proof backup and recovery'

There are currently no comments. Be the first!

Post A Comment

You must be logged in to post a comment, please Sign in

Revision ID	Author	Timestamp	Comment
20516	Dvellante	09 Jan 28 21:07:50
18549	Dab4168	08 Dec 31 19:13:55
14094	Dab4168	08 Feb 20 10:52:16	Removed category: Level 1
10172	Dvellante	07 Aug 28 15:45:43
7414	Dab4168	07 Mar 08 17:47:17	Re-categorization
5739	66.202.41.205	07 Feb 02 13:15:11
4887	Dvellante	07 Jan 10 17:43:48	/* Initiative summary */
4800	Twostardav	07 Jan 06 01:45:12	/* Risks of implementing fail proof backup and recovery */
4799	Twostardav	07 Jan 06 01:38:01	/* Initiative summary (150 words) */
4798	Twostardav	07 Jan 06 01:36:29	/* Deploy phase (250 WORDS) */
4797	Twostardav	07 Jan 06 01:35:43	/* Design phase (250 WORDS) */
4796	Twostardav	07 Jan 06 01:35:12	/* Key analysis milestones */
4795	Twostardav	07 Jan 06 01:34:36	/* Key analysis milestones */
4794	Twostardav	07 Jan 06 01:32:31	/* Key analysis milestones */
4793	Twostardav	07 Jan 06 01:31:54	/* Expectations (Out-of-scope */
4792	Twostardav	07 Jan 06 01:31:00	/* Key analysis milestones */
4791	Twostardav	07 Jan 06 01:28:56	/* Analyze phase */
4790	Twostardav	07 Jan 06 01:28:35	/* Analyze phase */
4789	Twostardav	07 Jan 06 01:25:52	/* Analyze phase (250 WORDS) */
4788	Twostardav	07 Jan 06 01:25:38	/* Expectations (Out-of-scope) (200 WORDS) */
4787	Twostardav	07 Jan 06 01:24:30	/* Risks of implementing fail proof backup and recovery (150 words) */
4786	Twostardav	07 Jan 06 01:22:18	/* Fail proof backup and recovery capability */
4785	Twostardav	07 Jan 06 01:21:38
4784	Twostardav	07 Jan 06 01:21:23	/* Fail proof backup and recovery capability */
4783	Twostardav	07 Jan 06 01:20:27	/* Fail proof backup and recovery capability Introduction */
4782	Twostardav	07 Jan 06 01:19:48	/* Fail proof backup and recovery capability (150 words) */
4781	Voelkl	07 Jan 05 18:58:36	/* Risks of implementing fail proof backup and recovery (150 words) */
4780	Voelkl	07 Jan 05 18:17:28	/* Key deployment milestones */
4779	Voelkl	07 Jan 05 18:15:07	/* Acceptance Test Considerations */
4778	Voelkl	07 Jan 05 18:14:27	/* Key design milestones */
4777	Voelkl	07 Jan 05 18:12:24	/* Specific operational goals of implementing failproof backup and recovery */
4776	Voelkl	07 Jan 05 18:11:26	/* Specific operational goals of implementing failproof backup and recovery */
4775	Voelkl	07 Jan 05 18:10:30	/* Key analysis milestones */
4774	Voelkl	07 Jan 05 18:09:00	/* Key analysis milestones */
4773	Voelkl	07 Jan 05 18:04:59	/* Expectations (Out-of-scope) (200 WORDS) */
4772	Voelkl	07 Jan 05 18:03:47	/* Fail proof backup and recovery capability (150 words) */
4771	Voelkl	07 Jan 05 18:01:59	/* Fail proof backup and recovery capability (150 words) */
4770	Voelkl	07 Jan 05 18:01:11	/* Fail proof backup and recovery capability (150 words) */
4769	Voelkl	07 Jan 05 17:55:22	/* Fail proof backup and recovery capability (150 words) */
4768	Voelkl	07 Jan 05 17:54:23	/* Fail proof backup and recovery capability (150 words) */
4767	Voelkl	07 Jan 05 17:51:16	/* The fail proof backup and recovery initiative */
4766	Voelkl	07 Jan 05 17:35:16	/* Risks of implementing fail proof backup and recovery (150 words) */
4765	Voelkl	07 Jan 05 17:31:32	/* Fail proof backup and recovery capability (150 words) */
4764	Voelkl	07 Jan 05 17:30:23	/* Fail proof backup and recovery capability (150 words) */
4763	Voelkl	07 Jan 05 17:25:03	/* Fail proof backup and recovery capability (150 words) */
4669	Mrgood	07 Jan 04 13:47:48
4614	Dvellante	07 Jan 03 23:03:00
3733	Dvellante	06 Dec 14 12:28:45
3218	Dvellante	06 Dec 06 23:19:22
3217	Dvellante	06 Dec 06 23:18:28
3145	Dab4168	06 Dec 06 18:05:15	/* Fail proof backup and recovery capability (150 words) */ misc
3122	Dvellante	06 Dec 06 17:19:54
3121	Dvellante	06 Dec 06 17:11:20	/* DEPLOY PHASE (250 WORDS) */
3120	Dvellante	06 Dec 06 17:11:05	/* DESIGN PHASE (250 WORDS) */
3119	Dvellante	06 Dec 06 17:10:52	/* ANALYZE PHASE (250 WORDS) */
3118	Dvellante	06 Dec 06 17:10:33	/* EXPECTATIONS (OUT-OF-SCOPE) (200 WORDS) */
3117	Dvellante	06 Dec 06 17:09:52	/* Acceptance Test Considerations */
3116	Dvellante	06 Dec 06 17:06:35
3115	Dvellante	06 Dec 06 17:06:23	setup

Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge.

Become a Member!

Login

Featured Research

Announcements

Technology Events

Contents

Fail proof backup and recovery capability

Specific operational goals of implementing failproof backup and recovery

Risks of implementing fail proof backup and recovery

The fail proof backup and recovery initiative

Expectations (Out-of-scope)

Analyze phase

Acceptance Test Considerations

Key analysis milestones

Design phase

Acceptance test considerations

Key design milestones

Deploy phase

Acceptance Test Considerations

Key deployment milestones

Initiative summary

Comments on 'Implementing fail proof backup and recovery'

Post A Comment

most recent wikibon articles

latest wikibon blog posts

company profiles

wikibon community information