Originating Author: Josh Krischer
The most-important phase in disaster recovery is planning. An enterprise planning for business continuity must perform a business impact analysis (BIA). The requirements for planning should be given by business units based on their needs. The business unit should calculate the losses incurred as the result of a disaster and in recreating the lost data. This is the most-critical step; it identifies what and how much the enterprise has at risk, as well as which business processes are most critical, thereby prioritizing risk management and recovery investment. The business continuity team (which should include the business process owners) must translate the business requirements into an overall business continuity plan that includes the technology, people, and business processes for recovery. Two of the most-important considerations are:
- Recovery time requirements — RTO
- Requirements for data restoration (RPO) — to which point in time the data must be restored.
Risks Impact Analysis has to take into account the impact of a risk, were it to become a reality, as well as the probability of that particular situation unfolding. Various strategies for lessening the impact of the event are then considered. Typically, these can include no action at all, insurance policies, or specific mechanisms that mitigate potential losses. These considerations determine the distances, technologies, and methods used to support the disaster recovery plan. The most important factor that influences the time required for recovery is the data consistency and integrity in the recovery site — not, as is commonly believed, the possibility of losing a few transactions.
Distance
One of the most deeply rooted myths is that longer distance ensures better disaster protection. In reality, the distance is dictated by potential risks, regulations, management decision, and the location of existing organizational assets. There is no ideal distance between primary and secondary (disaster recovery) data centers. It is true that increasing the distance between data centers reduces the likelihood that the two centers are affected by the same disaster. However, few disasters happen on a large scale, and increased distance between data centers increases the risk of broken links and line failures and may make it difficult or even impossible for employees to travel to the recovery site. A larger distance between the primary and secondary site means higher telecommunication costs and limits the choice of appropriate remote copy technique selection. It may also reduce performance and increase the chances of disruption. However, most global companies already own sites at extended distances, so the freedom of choice for their secondary site is limited by economic considerations.
The most effective approach to finding the optimal distance is to conduct a risk impact analysis study. This study should include mitigating risks from common outages like power, water, network, and telecommunications; geophysical disasters such as earthquakes or tornadoes; geopolitical situations like riots, terrorist attacks or strikes; a potential loss of people's lives, and personnel transportation issues. The optimal location is the one that minimizes the risks at an acceptable cost and meets the required SLAs and authorities' regulations. Companies may elect to invest in infrastructure to ensure availability of resources that are usually beyond their control. In most cases, regardless of the distance between the sites, each data center should have a separate main and/or emergency power supply and separate telecommunications paths. Independently of which data transfer technology is used, a redundant option should be provided by using two separate routes.
Despite IBM's lab demonstrations of synchronous remote copy (Metro Mirroring) over distances of up to 300 km (using DWDM), the practical use is limited by costs and performance penalties, which reduces the average practical distance to a range below 40km. Asynchronous techniques are designed to maintain a copy at much greater distances (up to 8000 km).
One important factor in planning, which has a large budget impact, is the required bandwidth between the sites. The bandwidth in synchronous remote copy should exceed peak data transfer requirements. For asynchronous remote copy, the bandwidth for average activity is sufficient. In many disaster recovery infrastructures, the costs of data transmission exceed hardware expenditures. A sound compromise between RPO requirements and remote copy bandwidth may lower data transfer costs significantly. Asynchronous remote copy allows lower bandwidth to be provisioned, however, at the cost of potentially higher data loss in case of a disaster.
Synchronous remote copy is commonly implemented over FC, ESCON, or FICON over fiber links or IP and iSCSI. Asynchronous techniques can employ fiber links but usually use IP or telecommunication links such as OC3 or OC12 e.g.
Action Item:
- Know the enterprise cost of downtime. Perform business impact analysis in the early design stage.
- Negotiate the required RTO and RPO with the business units.
- Perform risk analysis.
- Use professional service to compensate the lack of skill.
Footnotes: