Disaster recovery planning involves a complex series of trade-offs related to RPO, RTO, distance, geography, business value, latency, application inter-dependencies and threat levels. Disaster recovery solution architects also have to contend with numerous organizational, technology integration, and asset management challenges. When the goal is zero data loss, the challenges are even more pressing because infrastructure becomes increasingly complex to meet business requirements.
Post 9/11, the financial services industry in particular began investigating and in many cases implementing 3-Node Disaster Recovery solutions in a star or multi-hop topology. Such approaches rely on two data centers at synchronous distance and a third, more remote data center at asynch distance, often in Europe, that lowers the risks associated with localized disasters taking out the two synchronous sites. This topology, while expensive and cumbersome to deploy, was the only reliable way to ensure zero or near zero data loss.
Solutions are just becoming available to better address these challenges using asynchronous technologies, and we are seeing the potential to mitigate risks with less complex and possibly even more effective infrastructure.
At the May 4th, 2010 Wikibon Peer Incite we were joined by two practitioners from the financial services industry, Hylton and Steve - who remained anonymous on the call. These individuals are senior-level IT practitioners with extensive experience in IT management, disaster recovery, and business continuity.
Also joining the session was Dr. Alex Winokour, CTO of Axxana. He is an expert in the field of data management, data protection and storage. Winokour spent 11 years with IBM in research, where he achieved the title of Master Inventor. He has authored or co-authored more than 15 patents and was the CTO of XIV, founder of Sepaton, and a co-founder of Axxana.
Winokour described his invention called Phoenix, which invokes an airplane black box metaphor. Phoenix is a hardened and persistent storage system that is used to synchronously replicate data from a main site and acts as a buffer to a data center situated at asynchronous distance. If the main site is lost due to a disaster, the 'delta' data (i.e. the data that has not been updated to the asynchronous site) can be extracted from Phoenix using common cellular networks and brought up to synch with the remote site.
What Problems Does this Solve?
According to the practitioners on the call, this type of technology has the potential to simplify zero data loss by using asynchronous infrastructure which can be placed at a safer distance with lower communications costs and fewer data center resources (than a 3-node DR approach).
By providing a guaranteed point-in-time solution using two, instead of three data centers, organizations can decrease costs and simplify disaster recovery operations. Essentially, the practitioners see this as a way to achieve better RPO than synchronous (because there is a remote backup outside of synch distance) with the cost structure of an asynchronous data center infrastructure.
As well, given the interdependency of applications in the portfolio today, the cascading effects of a disaster can be enormous. Most organizations cannot justify a three data center infrastructure to support all applications, and this increases risk across the portfolio. A zero data loss solution that uses asynchronous infrastructure dramatically opens up the range of applications that can cost-effectively achieve zero data loss.
Key Advice to Peers
Both practitioners stressed the need to start with the business requirement and ensure that business operations drive IT decision-making. In many organizations, the business believes technology alone can solve DR problems, but in reality a DR solution must directly weave the business edicts throughout the planning, response, and recovery aspects associated with a disaster.
Critical to DR planning is an understanding of the RPO and RTO requirements of the business. This will provide a better understanding of the business exposure and help IT work with finance to set a reasonable budget for disaster recovery. The tighter the RPO and RTO requirements, the greater the complexity and expense of existing solutions. The goal should be to reduce complexity at the point of recovery which is what an asynchronous infrastructure can enable.
The two practitioners offered the following additional advice:
Hylton - Think through the execution of the recovery plan and understand its execution. Once you pull the trigger, the ripple effects will be fast and dramatic.
Steve - Simplicity is the absolute key to success. Complexity at the point of recovery is very dangerous. Make sure the plan is practical from a human perspective - get the people side right.
Starting Points
Wikibon member and data center consultant Josh Krischer of Josh Krischer Associates contributed to the call and wrote a research note pertaining to it, Planning for Remote Mirroring, in which he provides practical steps for practitioners in planning DR. In summary, the Wikibon members on the call concur with Krischer's following recommendations:
- Perform a business impact analysis (BIA),
- Perform a risk impact analysis,
- Set RPO and RTO with the business lines,
- Understand your network recovery objectives.
Action Item: Disaster recovery planning must involve business input from the start to drive requirements and ultimately determine what the appropriate IT solution. The greater the complexity of recovery the greater the risk. CIOs should endeavor to weave business requirements throughout DR planning, simplify infrastructure especially at the point of recovery, and architect zero data loss solutions that can practically support the business from a human capital perspective.
Footnotes: