Tip: Hit Ctrl +/- to increase/decrease text size)
Storage Peer Incite: Notes from Wikibon’s May 4, 2010 Research Meeting
More than a decade after "disaster recovery" became "business continuity planning", perhaps the most common error that IT department planners make, said the experts at Tuesday's Peer Incite meeting on zero-data-loss DR, is still to focus too closely on the technology. Effective business continuity planning must start with business decisions that cannot be made by IT and then must balance personnel issues with technology in implementing those business decisions.
The first major question to face is how much data loss and downtime are acceptable in a disaster. IT certainly will have opinions, but these are business decisions that must be made by business leaders based on the options and relative expenses IT presents, because those business leaders have the responsibility to judge the relative damage that a data loss will cause. Their first instinct will always be to choose zero data loss. The two practitioners, identified only by first names, who spoke at the meeting said that the much lower cost of new zero data loss architecture pioneered by Axxana often makes that the right target. The IT business continuity plan must also be designed to work with the business's recovery plan, which in many organizations is quite detailed.
Another common error is to presume that personnel involved in a disaster will be available and able to relocate. It does little good to have systems restored with zero data loss in a half hour at some remote site if the personnel needed to run those systems are either unable or unwilling to relocate with them. Personnel are people. They have families, including often extended families, who will come first in a disaster. They may also be injured or otherwise incapacitated themselves. And even if they are available and willing to relocate, travel may be impossible. Major exit routes may be blocked or snarled in traffic jams, airplanes will probably be grounded, and in North America in particular trains may not provide the required connections.
Finally the technology itself may hold hidden issues. For instance, the very creative Axxana solution for zero data loss depends on cellular data transfer to deliver those final transactions captured in its armored buffer to the recovery site. But cellular is not always available in a disaster. Experience shows that cellular systems can be swamped with personal calls of people caught up in the emergency, and cell sites are vulnerable to the same disaster that has shut down the data center. When Hurricane Katrina hit New Orleans, for instance, one of the two cellular systems serving the city went down almost immediately. The other survived for several hours, with some sites remaining in operation throughout the recovery, but the network was pushed to its limits as people tried to use it to locate loved ones. This is not a criticism of the Axxana system, but it does mean that Axxana users need to talk to their cellular system provider to ensure that the local cell site has emergency power and is hardened to survive whatever disaster are likely in that location. And they need to get a priority number for their emergency data transfer. This system has been implemented since Katrina mainly to ensure that emergency services calls are not blocked by cellular network overloads, but numbers are usually available for legitimate emergency needs of private organizations.
Successful disaster recovery, therefore, requires forethought, careful planning, and coordination between the business, which must define the need, and IT, which must implement for the IT portion of business recovery. G. Berton Latamore
Achieving Zero Data Loss Disaster Recovery Using Asynchronous Infrastructure
Disaster recovery planning involves a complex series of trade-offs related to RPO, RTO, distance, geography, business value, latency, application inter-dependencies and threat levels. Disaster recovery solution architects also have to contend with numerous organizational, technology integration, and asset management challenges. When the goal is zero data loss, the challenges are even more pressing because infrastructure becomes increasingly complex to meet business requirements.
Post 9/11, the financial services industry in particular began investigating and in many cases implementing 3-Node Disaster Recovery solutions in a star or multi-hop topology. Such approaches rely on two data centers at synchronous distance and a third, more remote data center at asynch distance, often in Europe, that lowers the risks associated with localized disasters taking out the two synchronous sites. This topology, while expensive and cumbersome to deploy, was the only reliable way to ensure zero or near zero data loss.
Solutions are just becoming available to better address these challenges using asynchronous technologies, and we are seeing the potential to mitigate risks with less complex and possibly even more effective infrastructure.
At the May 4th, 2010 Wikibon Peer Incite we were joined by two practitioners from the financial services industry, Hylton and Steve - who remained anonymous on the call. These individuals are senior-level IT practitioners with extensive experience in IT management, disaster recovery, and business continuity.
Also joining the session was Dr. Alex Winokour, CTO of Axxana. He is an expert in the field of data management, data protection and storage. Winokour spent 11 years with IBM in research, where he achieved the title of Master Inventor. He has authored or co-authored more than 15 patents and was the CTO of XIV, founder of Sepaton, and a co-founder of Axxana.
Winokour described his invention called Phoenix, which invokes an airplane black box metaphor. Phoenix is a hardened and persistent storage system that is used to synchronously replicate data from a main site and acts as a buffer to a data center situated at asynchronous distance. If the main site is lost due to a disaster, the 'delta' data (i.e. the data that has not been updated to the asynchronous site) can be extracted from Phoenix using common cellular networks and brought up to synch with the remote site.
What Problems Does this Solve?
According to the practitioners on the call, this type of technology has the potential to simplify zero data loss by using asynchronous infrastructure which can be placed at a safer distance with lower communications costs and fewer data center resources (than a 3-node DR approach).
By providing a guaranteed point-in-time solution using two, instead of three data centers, organizations can decrease costs and simplify disaster recovery operations. Essentially, the practitioners see this as a way to achieve better RPO than synchronous (because there is a remote backup outside of synch distance) with the cost structure of an asynchronous data center infrastructure.
As well, given the interdependency of applications in the portfolio today, the cascading effects of a disaster can be enormous. Most organizations cannot justify a three data center infrastructure to support all applications, and this increases risk across the portfolio. A zero data loss solution that uses asynchronous infrastructure dramatically opens up the range of applications that can cost-effectively achieve zero data loss.
Key Advice to Peers
Both practitioners stressed the need to start with the business requirement and ensure that business operations drive IT decision-making. In many organizations, the business believes technology alone can solve DR problems, but in reality a DR solution must directly weave the business edicts throughout the planning, response, and recovery aspects associated with a disaster.
Critical to DR planning is an understanding of the RPO and RTO requirements of the business. This will provide a better understanding of the business exposure and help IT work with finance to set a reasonable budget for disaster recovery. The tighter the RPO and RTO requirements, the greater the complexity and expense of existing solutions. The goal should be to reduce complexity at the point of recovery which is what an asynchronous infrastructure can enable.
The two practitioners offered the following additional advice:
Hylton - Think through the execution of the recovery plan and understand its execution. Once you pull the trigger, the ripple effects will be fast and dramatic.
Steve - Simplicity is the absolute key to success. Complexity at the point of recovery is very dangerous. Make sure the plan is practical from a human perspective - get the people side right.
Starting Points
Wikibon member and data center consultant Josh Krischer of Josh Krischer Associates contributed to the call and wrote a research note pertaining to it, Planning for Remote Mirroring, in which he provides practical steps for practitioners in planning DR. In summary, the Wikibon members on the call concur with Krischer's following recommendations:
- Perform a business impact analysis (BIA),
- Perform a risk impact analysis,
- Set RPO and RTO with the business lines,
- Understand your network recovery objectives.
Action item: Disaster recovery planning must involve business input from the start to drive requirements and ultimately determine what the appropriate IT solution. The greater the complexity of recovery the greater the risk. CIOs should endeavor to weave business requirements throughout DR planning, simplify infrastructure especially at the point of recovery, and architect zero data loss solutions that can practically support the business from a human capital perspective.
Planning Multiple Data Centers to achieve Business Resilience
Consolidation has been a major trend in organizations globally. Operationally it is more cost effective to run a single data center that multiple data centers. Over-consolidation however, bring other risks.
With the advent of technologies such as Axxana, extremely high business resilience and zero data loss can be achieved with two data centers at asynchronous distances. Before Axxana, achieving this same level of resilience required three data centers.
Just this replication topology can be achieved across the Atlantic or across the continent does not mean that it should, however. Most studies show that a 300 mile separation between data centers provides as much protection against disaster as 3,000 miles, while making it much easier for critical staff, and possibly their families, to reach the secondary location both in testing and particularly in a real disaster. CIOs and disaster planners should ask themselves a simple question: “If there is a flood, earthquake, or hurricane at my primary location, would I leave family and friends and move across the continent for my company? Could I ask my staff to do that?”
Action Item: Second data centers should ideally be set up to allow easy traveling in the case of a disaster. The second data center should be big enough to accommodate traveling staff after a disaster. The locations should also be sited to provide maximum travel options -- ideally highway and train as well as air -- to maximize the chance that travel to the secondary center is possible in a disaster. Balancing load more evenly across two (or more) data centers is likely to give higher levels of practical resilience.
Planning for Remote Mirroring
The most-important phase in disaster recovery is planning. An enterprise planning for business continuity must perform a business impact analysis (BIA). The requirements for planning should be given by business units based on their needs. The business unit should calculate the losses incurred as the result of a disaster and in recreating the lost data. This is the most-critical step; it identifies what and how much the enterprise has at risk, as well as which business processes are most critical, thereby prioritizing risk management and recovery investment. The business continuity team (which should include the business process owners) must translate the business requirements into an overall business continuity plan that includes the technology, people, and business processes for recovery. Two of the most-important considerations are:
- Recovery time requirements — RTO
- Requirements for data restoration (RPO) — to which point in time the data must be restored.
Risks Impact Analysis has to take into account the impact of a risk, were it to become a reality, as well as the probability of that particular situation unfolding. Various strategies for lessening the impact of the event are then considered. Typically, these can include no action at all, insurance policies, or specific mechanisms that mitigate potential losses. These considerations determine the distances, technologies, and methods used to support the disaster recovery plan. The most important factor that influences the time required for recovery is the data consistency and integrity in the recovery site — not, as is commonly believed, the possibility of losing a few transactions.
Distance
One of the most deeply rooted myths is that longer distance ensures better disaster protection. In reality, the distance is dictated by potential risks, regulations, management decision, and the location of existing organizational assets. There is no ideal distance between primary and secondary (disaster recovery) data centers. It is true that increasing the distance between data centers reduces the likelihood that the two centers are affected by the same disaster. However, few disasters happen on a large scale, and increased distance between data centers increases the risk of broken links and line failures and may make it difficult or even impossible for employees to travel to the recovery site. A larger distance between the primary and secondary site means higher telecommunication costs and limits the choice of appropriate remote copy technique selection. It may also reduce performance and increase the chances of disruption. However, most global companies already own sites at extended distances, so the freedom of choice for their secondary site is limited by economic considerations.
The most effective approach to finding the optimal distance is to conduct a risk impact analysis study. This study should include mitigating risks from common outages like power, water, network, and telecommunications; geophysical disasters such as earthquakes or tornadoes; geopolitical situations like riots, terrorist attacks or strikes; a potential loss of people's lives, and personnel transportation issues. The optimal location is the one that minimizes the risks at an acceptable cost and meets the required SLAs and authorities' regulations. Companies may elect to invest in infrastructure to ensure availability of resources that are usually beyond their control. In most cases, regardless of the distance between the sites, each data center should have a separate main and/or emergency power supply and separate telecommunications paths. Independently of which data transfer technology is used, a redundant option should be provided by using two separate routes.
Despite IBM's lab demonstrations of synchronous remote copy (Metro Mirroring) over distances of up to 300 km (using DWDM), the practical use is limited by costs and performance penalties, which reduces the average practical distance to a range below 40km. Asynchronous techniques are designed to maintain a copy at much greater distances (up to 8000 km).
One important factor in planning, which has a large budget impact, is the required bandwidth between the sites. The bandwidth in synchronous remote copy should exceed peak data transfer requirements. For asynchronous remote copy, the bandwidth for average activity is sufficient. In many disaster recovery infrastructures, the costs of data transmission exceed hardware expenditures. A sound compromise between RPO requirements and remote copy bandwidth may lower data transfer costs significantly. Asynchronous remote copy allows lower bandwidth to be provisioned, however, at the cost of potentially higher data loss in case of a disaster.
Synchronous remote copy is commonly implemented over FC, ESCON, or FICON over fiber links or IP and iSCSI. Asynchronous techniques can employ fiber links but usually use IP or telecommunication links such as OC3 or OC12 e.g.
Action item:
- Know the enterprise cost of downtime. Perform business impact analysis in the early design stage.
- Negotiate the required RTO and RPO with the business units.
- Perform risk analysis.
- Use professional service to compensate the lack of skill.
The Human Factor in Disaster Recovery and Business Continuity Planning
Forward-thinking organizations will recognize that consideration of the human-impact factor is fundamental to business continuity and disaster recovery planning. A highly-available IT infrastructure, having both systems and people that reside entirely within the potential impact zone of a single disaster, will not provide a company with the business resiliency necessary to maintain operations through a disaster.
Companies should ideally maintain the geographic separation of multiple data centers at a sufficient distance to ensure that at least one will survive any man-made event, such as an act of terrorism, or a natural disaster, such as an earthquake, a volcanic eruption, a hurricane, or flood. Companies should not only provide redundant systems but should also replicate data between the multiple sites. And while it has become increasingly popular to develop lights-out, secondary data centers, companies should give careful consideration to having multiple, fully-staffed operations centers. While a company may be able to recover applications to a remote, replicated site, if only a limited number of knowledgeable employees are stationed at the recovery location, and the workers within the disaster zone are managing their own, more personal disasters, the business may not be able to recover from an operational perspective. DR plans must allow for the real-life personal choices between business needs and personal and family needs that individual employees will make during a disaster and should expect that personal safety and family will come first. Therefore, disaster recovery and business continuity capabilities should not be "key-man" dependent. In addition, when the transportation system is disrupted, as happened to air travel within, to and from Western Europe during the recent volcanic eruption in Iceland, even if individuals are willing to go to another location to work, the residual effects of the disaster may prevent it.
As organizations take greater advantage of lower-cost human capital in developing regions in the deployment of data centers, they should also consider the quality of the physical infrastructure for both businesses and individuals. An earthquake in San Francisco, where building codes are well-established and rigorously enforced, will have substantially less human impact than an earthquake of similar magnitude in many developing countries. Even within developed countries, the impact of similar terrorist attacks may vary widely. In New York City, where building height and population density are both high, the impact of a terrorist, 9-11 style attack will be much greater than in London, where building heights and population density are more modest.
Action item: CIOs should work with business and organizational executives to ensure that disaster recovery and business continuity plans consider the impact on both infrastructure and people. The recovery of IT systems is not the same as recovery of business operations. Regional differences in terms of disaster survivability are significant; employees will often put self and family before business, and disruptions in transportation systems can prevent employees from leaving the disaster zone, even when they are willing to do so. When possible, organizations should provide for distributed recovery scenarios with well-distributed workers, so that no single location and no single pool of human capital presents a single point of failure.
Recovery Solutions are the Name of the Game in BCDR
Sell Solutions
Solutions selling is the name of the game for the business continuity/DR marketplace today. This is the single most important and clear message from practitioners to the vendor community coming from the May 4, 2010 Wikibon Peer Incite discussion on zero data loss strategies. On the call, the Wikibon community was joined by two practitioners from global financial firms holding senior level IT positions with extensive experience in IT management, disaster recovery and business continuity. It has been more than 20 years since the industry began the transition from a focus on data center recovery to business continuity planning and understanding the role technology plays in enabling the recovery of normal business operations in the face of operational failures of many types. However, according to these experts, the vendor community has a ways to go to better understand recoverability requirements, costs, and benefits unique to each business they sell to.
Be Reasonable and Practical
This core message about recovery solutions was supported with other guidance from the practitioner community:
- Sell solutions on a more practical and reasonable basis. Soft costs and artificial savings are difficult to project. Understand the business and quantify data and service loss in terms of human efforts, business opportunity cost, loss in productivity, and reputation.
- Sell solutions on the basis that:
- Can guaranty functionality and benefits.
- Demonstrates how the solution will work in my business and technology environment, and,
- Will convince customers that complexities can be overcome.
- If the solution is overly complex, support it with services.
- Sell a solution (product, services, support) – not a fragmented product. *Demonstrate that you understand the problem and show examples of peer solutions (industry peers, please).
Data Recording and Information Dispersal
Example of Two Protection and Recovery Strategies
Practitioners must consider several trade-offs when deciding on data protection and recovery solutions. For example, on the call we raised the point about the differences between “electronic data recording (EDR)” and “information dispersal algorithm (IDA)“ technologies as solutions data protection and recovery. EDR uses a hardened “cache”, or persistent storage to synchronously replicate data from a main site. This is a buffer to data centers situated at asynchronous distances from each other. If data at the primary site is lost, the 'delta' data in the EDR can be rolled forward through alternate networks and synchronized to the secondary or recovery site.
IDA, on the other hand, is a completely different solutions strategy and architecture. IDAs split data into multiple pieces and allow it to be recovered from some threshold subset of those pieces. In a typical implementation, a data center might create an IDA environment which is 10-of-15. This requires that 10 data streams, nodes, or storage locations be accessed to continue processing or recover data in the case of a disaster or data loss event. However, if 10 of 15 nodes fail or are affected by a disaster, all data recovered is unusable. On the flip side, if fewer than 10 nodes are attacked by a side channel, hacker, or malware, the data will have no value.
Both technologies can be configured to provide zero data loss solutions to an enterprise, but the vendor AND end-user must be able to determine the trade-offs in terms of cost, value, complexity, interoperability, recoverability, maturity, and a host of other factors.
Action item: The vendor community must engage practitioners in recovery solutions selling. This means clearly demonstrating knowledge of business requirements and data protection options. At the same time, the BCDR practitioner must stay connected to innovations in data protection and recovery technologies and be able to engage in the debate of the pros, cons, and business trade-offs offered by the BCDR solutions marketplace.
4 Myths in BCDR
Business Continuity and Disaster Recovery (BCDR) describe an organization's preparation for unforeseen risks to continued operations. The trend of combining business continuity and disaster recovery into a single term has resulted from a growing recognition that business and technology executives share responsibilities for assessing risks, establishing control procedures and systems, etc., rather than developing plans in isolation.
As the BCDR discipline matures, practitioners must dispel some of the folklore of the past to move forward, become more resistant to business disruptions, recover more quickly and with more integrity, and control costs. Four of these BCDR myths were discussed by banking practitioners on the May 4, 2010 Wikibon Peer Incite call on zero data loss strategies:
- Longer distances between primary and backup business and technology operations ensures better business continuity. Not so. Distance creates risk and must be dictated by recoverability requirements, regulations, management decision, and the location of business assets prior to and during recovery efforts, including people, information, partners, customers, and of course, information systems.
- More replication, more backup, equals more data protection. Wrong. Replication creates risks and security exposures. Replication requirements must be expertly synchronized with failover and recovery requirements. All replication/backup created that cannot be tied to a recovery requirement (failure situation, RPO, RTO, business recovery objective), should be canceled/deleted.
- Business continuity is a technology issue. Really? BCDR is a business discipline enabled by technology. Technology creates business risk and the need for the BCDR discipline (e.g., system, data center, network failures), while at the same time enabling the discipline with capability and functionality to recover in the event of a loss (e.g., backup and recovery systems).
- IT and business users have the same interests in BCDR. Sorry. Simply said, IT is more focused on backup, but the business pays for recovery/resiliency. CTOs employ data backup and recovery professionals, business executives hire risk managers to ensure critical business functions are available in the case of a disruption in IT operations. These interests of course are linked, but different.
Action item: BCDR is a discipline, not a project. Know what you're paying for. Clearly communicate business recovery objectives (BROs) to IT, understanding IT's technical and operational capabilities, participate in the BCDR tests and demonstrations. From the IT side, make sure the business understands the risks technology presents to their business, and architect systems that can demonstrate RTO/RPO compliance on an ongoing basis.