Establishing a disaster recovery strategy is not for the faint-hearted. Disasters happen or don’t happen. When they don’t, you have spent too much. When they do, you have never spent enough. This article discusses the main technologies that are available, the criteria against which they should be measured, and the methodology for deciding the correct strategy. This can be used to agree the ideal strategy for different application groups, and as input to establishing compliance with Sarbanes-Oxley and other compliance initiatives.
Contents |
Disaster recovery strategies for storage capability
There are six main disaster recovery technology options:
- Consolidated tape or disk media backup is the least expensive solution and has the greatest permanent data loss and the slowest recovery times. It is well suited for addressing limited disruptions, such as data corruption. Improved techniques such as disk-to-disk backups and virtual tape can significantly improve efficiency and reduce the time to recovery.
- High-availability storage networks can overcome local server failures by providing access to a standby or clustered server system to ensure continuous operation. Permanent data loss can be low but not zero. However, the distance between the data centers is also very short, which increases the probability that a disaster will take out both sites. It is not regarded as a viable disaster recovery topology for most organizations.
- Remote point-in-time update replication copies the changes made to data to another building or city. Changes can be replicated at scheduled times during the day or whenever changes occur. This technology accommodates any distance requirements, as there are no latency limitations to overcome. It offers faster recovery times than tape backup, but it cannot achieve zero permanent data loss. Data recovery is measured in hours.
- Asynchronous replication has significantly lower data recovery times than point-in-time update replication. Asynchronous replication allows the primary and remote copies to be out of synchronization by a range of seconds to minutes. Permanent data loss is low but not zero.
- Synchronous disk replication is suitable for applications that require the fastest recovery with zero permanent data loss. All disk writes are synchronously copied to a remote site across a high-performance network before a transaction is acknowledged, eliminating any transaction loss. This technology is sensitive to network latency, which limits the practical distance between sites to typically less than 50 miles.
- Three-node topologies allow a combination of technologies to allow very high probabilities of zero data loss at long distances. They combine synchronous replication (local recovery node) with asynchronous replication (remote recovery node). The local recovery node can accommodate very rapid recovery with a high probability of zero permanent data loss. The remote recovery node provides for recovery with low permanent data loss “in the unlikely event” that both the primary and local recovery nodes are impacted. The 3 node disaster recovery article gives more information.
There are five main criteria for evaluating them:
- Distance (greater distance reduces the probability of both sites being hit)
- Probability of permanent data loss
- Amount of permanent data loss
- Recovery time
- Cost
Specific operational goals of Disaster recovery strategy for storage
The likely investment required to establish a 3 node disaster recovery topology is between $50,000 and $100,000 with an elapsed time of between 2 to 4 months. [Note: These figures assume a Standard wikibon business model organization with $1B in revenue with 4,000 employees and an IT budget of $40M per year.
A successful establishment of disaster recovery strategy will:
- Show executive management, internal auditors and external auditors that all strategies have been considered for the major application groups
- Establish a methodology for deciding between the strategies that can be documented and agreed
- Provide a key input into the compliance process
- Ensure that the correct level of overall spend on disaster recovery is established, and can be easily reviewed
Risks in a disaster recovery strategy initiative
The major risks of a disaster recovery strategy initiative are:
- The risks to the business are evaluated incorrectly, and that
- Applications are included in the highest level disaster recovery strategies for organization expediency
- Key supporting applications (not seen by the users) are not considered together with the major applications
The Disaster recovery strategy for storage initiative
This initiative will be completed when the correct disaster recovery technology decisions have been made for all applications within the organization, and agreed to by IT, line of business executives, and senior management. In addition, those decisions including the assumptions will have been documented in a way that allows direct usage by the groups responsible for compliance.
Expectations (out-of-scope)
The following factors, although necessary for a disaster recovery strategy initiative to be successful, are not in scope:
- The requirements of all compliance regulations have been studied and documented
- A risk management structure is in place and a key decision maker has been identified.
- Roles within and the IT organization and coordination processes with other business groups are defined and understood.
Analyze Phase
Acceptance Test Considerations The analyze phase will be completed when the recommendations have been accepted by the sponsor, and agreement has been reached on the correct disaster recovery strategy for each application group by IT, heads of lines of business, and executive management
Key analysis milestones
Analysis should take 6-14 weeks for most organizations.
- An effective sponsor of the initiative is identified
- It is important that the sponsor can resolve any organizational issues, and has a familiarity with risk metrics and methodologies
- Data collected
- Determine the major applications groups that need to be considered together for disaster recovery purposes
- Determine the key disaster recovery parameters for the current disaster recovery system (see diagram for example) for the application groups
- Agree the correct disaster recovery strategy for each application group
- Determine the costs of the different topologies, including equipment and software costs, additional data center costs, telecommunication costs, and implementation costs
- Determine the optimum location of the disaster sites(s), balancing increased telecommunication and recovery costs with decreased business risks of losing all sites with greater distance
- Business case constructed:
- Analyst constructs business case / cost benefit analysis detail
- If necessary, construct business case of alternative scenarios
- Recommend the best alternative to the business
- Recommendations and business cases accepted by sponsor and any other stakeholders necessary
Documentation Phase
Acceptance Test Considerations The documentation phase will be completed when the disaster recovery strategy is fully documented and submitted, and is sufficient to initiate the design phase (if required)
The documentation phase should take 2-4 weeks for most organizations.
Design & Deploy Phases
The design and deploy phases (if required) for designing the specific changes recommended to reduce business risk should be set up, funded, and staffed. How this should be tackled will depend on differences between the current disaster recovery strategies, and the proposed strategies. Where there is significant business exposure, these areas should be expedited. Where possible, the disaster recovery changes should take place at a measured pace, with time between looking at application groups to ensure that the changes to technology, processes and procedures have settled down.
Initiative summary
Establishing and documenting a comprehensive disaster recovery strategy is essential for all organizations. The strategies deployed for applications groups should be reviewed on a regular basis (every two years at a maximum or for fast changing organizations every year). The cost of such a strategy review should be between $50,000 to $100,000 and take two to four months. Documentation of such a review is very important, both as a basis for initiating the effort to design and deploy the recommendations, and to provide input to the compliance processes.