Disaster recovery in most organizations is based on a two-node topology, with one production site and a remote (>200 miles) backup disaster recovery location. In the event of a disaster, this approach will always result in some permanent data loss. This is true even if remote replication is used, as synchronous remote replication is not possible over long distances.
Permanent data loss is a growing problem for organizations as the degree of interdependency between systems increases. Previously, organizations had manual procedures and paper trails they could utilize to recover data. While many of these procedures remain in place, they are inadequate in today’s interconnected world and only create an illusion they can be relied upon for data recovery.
For many organizations, the business risk associated with this reality is becoming unacceptable, and reducing the probability of permanent data loss is a business imperative.
Contents |
3-node disaster recovery capability
3-node disaster recovery topologies are created by a combination of technologies to allow very high probabilities of zero data loss at long distances. They combine synchronous replication (local recovery node) with asynchronous replication (remote recovery node). The addition of a local recovery node can accommodate very rapid recovery with a high probability of zero permanent data loss. The remote recovery node provides for recovery with low permanent data loss “in the unlikely event” that both the primary and local recovery nodes are impacted.
There are two types of 3-node disaster recovery topologies, cascade and multi-target. In the cascade topology, the local recovery site has only storage and the remote recovery site is only connected to this site. In the multi-target, the local recovery site is a full data recovery node (with servers), and the remote recovery site is connected to both sites.
The primary advantage of these topologies is that the business impact of permanent data losses can be almost eliminated. This allows the business processes for recovery to be simplified. In addition, disaster recovery testing is simplified and IT personnel can be shared between the primary node and the local backup nodes. Moreover, this approach provides the foundation for much faster system recovery. A fuller definition can be found on the wikibon entry for 3 node disaster recovery.
Specific operational goals of implementing 3 node disaster recovery
The likely investment required to implement 3 node disaster recovery topology is between $1.5 million to $8.5 million, with an elapsed time of between 9 to 12 months. [Note: These figures assume a Standard wikibon business model organization with $1B in revenue with 4,000 employees and an IT budget of $40M per year. The scenario assumes that 300 terabytes are installed, with a mixture of high performance SAN, mid-range storage solutions, some direct attached storage, and some NAS]. The amount of data that is remotely replicated is 14 terabytes. Compared to a 2 node traditional recovery from tape topology, a successful 3 node disaster recovery implementation will:
- Reduce the probability of permanent data loss by a factor of up to 10 times in a year (e.g. 7% to 0.7%)
- Reduce the probability of a systems outage by a factor of up to 10 times in a year (e.g. 7% to 0.7%)
- Reduce the expected loss from a disaster by up to 80 times
- Simplify the business processes resutling in improved employee productivity equating to 1.5% of revenue, gradually realized over a three year period
Notwithstanding the 'insurance policy' nature of the investment, an analysis of the need for 3 node disaster recovery is likely to find that a business case will be good (ROI and IRR>400%) for the right types of organizations. All the benefits will come from the business (mitigating exposure) and virtually none from IT due to the large increase in infrastucture spend. [Note: assumes a higher risk business where where exposures to permanent data loss are high (e.g. financial services)].
Risks of implementing 3 node disaster recovery systems
3-node data centers have been implemented in many financial organizations. These organizations have often used mainframes, and have very competent and self-sufficient technical staff.
Recent improvements in storage controller design and performance have enabled 3-node data centers to be implemented on all systems across all industries. However, these technologies are not in wide-spread use, are complex to analyze and deploy, and overall would be a medium to high risk project.
The major risks to a 3 node disaster recovery initiative are:
- Mis-judging estimates of business losses in the event of systems outages and permanent data loss can lead to inadequate protection or over-spending.
- Locating the remote recovery node too far away from the primary site exposes the system to significant performance degradation and undue communications costs.
- Increasing the overall complexity of IT infrastructure escalates the need for adequate testing.
3 node disaster recovery initiative
The 3 node disaster recovery strategy will be implemented when the 3 node infrastructure has been designed, built, tested, implemented and successfully handed over to operations so that it can run as specified without external support. In addition, a fail-over fail back test should be run within 12 months of cutover.
Expectations (out-of-scope)
The following factors that are not within the scope of the 3 node disaster recovery initiative are very important. If these factors are not in place or addressed, the probability of a successful 3 node disaster recovery outcome will be significantly lower:
- An acceptance of business risk metrics and methodologies by senior executives within the organization.
- A business risk manager who has the authority to evaluate and recommend/authorize expenditure to reduce risk to the business.
- The business can fund a significant increase in IT capital and operational budgets.
- The expertise and project management skills are available to analyze, design, test and deploy a complex disaster recovery project.
Analyze phase
Acceptance test considerations
The analyze phase will be completed when the initial business case has been accepted by the sponsor, and agreement has been reached to proceed to the design phase or kill the project.
Key analysis milestones
This phase should take about 6-12 weeks and 30-60 person days of effort.
- An effective sponsor of the initiative is identified
- It is important that the sponsor can resolve any organizational issues, and has a familiarity with risk metrics and methodologies
- It is important that a process be agreed between the interested parties on establishing and agreeing potential risks from data loss to the organization. The groups should include the corporate risk manager, Business managers responsible for the business processes and IT systems, audit managers, governance managers (S-Ox, SEC, etc), and legal.
- The Business Impact Analysis (BIA) should include multiple methods for establishing the impact of a disaster, and a method for "triangulating" to a result that is "close enough", and has the confidence and support of the stakeholders.Different methods of estimating the BIA include:
- Bottom-up risk analysis (e.g., the risks, and the probability of those risks)
- What insurance premiums would be necessary to mitigate risk
- What reserves should or are required to be held in case of a disaster (e.g., Basel II requirements)
- Wall Street estimates of the short and long-term impact of a disaster on the capitalization of the company
- Data collected
- Determine the key disaster recovery parameters for the current disaster recovery system (see diagram for example) for the key application groups
- Agree which application groups to consider for 3-node disaster recovery (usually mission critical applications)
- Determine the key disaster recovery parameters for the alternative disaster recovery topologies (e.g., 2-node, 3-node cascade, 3-node multi-target)
- Determine the costs of the different topologies, including equipment and software costs, additional data center costs, telecommunication costs, and implementation costs
- Determine the optimum location of the 3 nodes, balancing increased telecommunication and recovery costs with decreased business risks of greater distance
- Business case constructed:
- Analyst constructs business case / cost benefit analysis detail of alternative scenarios
- Recommend the best alternative to the business
- Initial Design and business case accepted by sponsor and any other stakeholders necessary
Design phase
Acceptance test considerations
The design phase will be completed when the design has been accepted by the sponsor and agreed to by the key stakeholders, the RFP has been issued and key hardware, software and telecommunication vendors selected, agreement to fund the project has been agreed, and agreement has been reached to proceed to the deploy phase or kill the initiative.
Key design milestones
This phase should take about 10-14 weeks and about 60 person days of effort.
- Primary vendor decided
- Decide on vendor hardware and software technologies available and issue RFP/solicit bids
- EMC, HDS, HP, & IBM would be the primary vendors to consider
- Determine telecommunication requirements and issue RFP/solicit bids
- The 3-node technology selected will have a major impact on the telecommunications costs, and the two components should be decided together
- Decide on vendor hardware and software technologies available and issue RFP/solicit bids
- Disaster recovery procedures designed
- Design procedures around hardware and software decided and design integrated with current procedures
- Pay particular attention to the criteria for failing over to the remote disaster recovery node
- Determine training requirements for operations
- Design test procedures and scripts
Deploy phase
Acceptance test considerations
The deploy phase will be completed when:
- The 3 node disaster recovery is built, tested, and brought into service
- A fail-over fail back test is run within 12 months of cutover
- The operations group is fully responsible for all aspects of the installation
Key deployment milestones
This phase should take about 4-6 months and cost between $1.5 million & $8.0 million.
- 3 node disaster recovery topology built
- Installation of storage hardware and storage management functionality
- Installation of the telecommunication facilities
- Update and creation of new process and procedures, with full documentation
- 3 node disaster recovery tested
- Testing of equipment, software, and procedures on historical data
- Testing of recovery on some non-mission critical live applications
- Testing of procedures for migration, backup, recovery, and disaster recovery
- Migration & Cut-over to 3 node disaster recovery completed
- Phased migration cut-over to 3 node disaster recovery
- Extensive monitoring of performance, reliability, & telecommunication performance
- 3 node disaster recovery initiative wrapped up
- Procedures set up for monitoring performance and implementing yearly testing (including yearly fail-over and fail-back testing)
- Procedures set up for adding additional storage, storage functionality, and telecommunications bandwidth
- Final review of documentation
- All project staff released and full hand-over to storage operations
3 node disaster recovery initiative summary
3 node disaster recovery is a technology that is complex and the analysis of the benefits is imprecise. Moreover, the metrics used are unfamiliar to many business executives. Explaining the concept and implication of zero data loss is not trivial. However, this should not be a reason for not evaluating a 3 node strategy, and identifying the risks that face an organization. Many organizations will determine that increasing the probability of zero data loss is a sound business strategy, and will build it into the design of the applications, business processes, and infrastructure.