Storage Peer Incite: Notes from Wikibon’s February 26, 2007 Research Meeting
New techniques are emerging to allow organizations to fail over from/to a third data center. This can enable DR testing without risking the test itself becoming a disastrous event.
Storage menage a trois for disaster recovery testing
In certain core infrastructure industries (e.g. finance, insurance, telecommunications, health care and transport) and other industries that support them, regulatory regimes dictate that organizations must reliably demonstrate they have the capability to fail-over and fail-back at distances of hundreds of miles. The reality is that demonstrating and testing this capability is becoming increasingly difficult and is requiring new approaches to disaster recovery.
For example, certain regulations demand that organizations have active backups of data hundreds of miles away from the main source of data to ensure that a metropolitan disaster does not take out all of the data. Under these circumstances, users need to put in place a mechanism that allows not only for ensuring that data can be recovered but also that they can verify through reasonable testing approaches that they have an appropriate architecture in place to mitigate disasters. This in many respects is becoming an increasing challenge.
What we see large organizations doing is essentially setting up a system of data replication that utilizes three different data centers. Two of the data centers are synchronized (A and B) such that data written to the main center is at the same time written in a synchronous manner to the backup site. That way, if there is a problem with A, fail-over to B is relatively instantaneous but more importantly, fail back, once A comes back on line also can be handled in a very smooth and seamless fashion.
However, because of the distance problem (i.e. synchronized storage nodes must be close enough in proximity to mitigate speed-of-light issues) sites A and B are usually within the same metropolitan area which violates the rules associated with placement of data centers. As a consequence, many organizations are introducing a third site (C) that is governed by a small lag between the B and C sites (typically), usually in the order of five to thirty minutes, but ensuring that the C site is far enough outside that same metropolitan area (usually over 200 miles to comply with regulations). This ensures that in the event of a rolling disaster that takes out site A and threatens site B, site C can have a a perfect copy of the data within (say) five minutes. As no data is lost, fail-over can be tested completely, and failed-back. This ensures that testing of the entire organization of data is reliable and can be accomplished with confidence.
Not all organizations need to consider this architecture today. However over the next few years, technologies will become more widely available and standard for constructing a three-node data center solution in a relatively inexpensive way. Users should be evaluating vendors in part based on their product road maps for introducing this capability.
The other key user issue that requires consideration is that the lag between paired synchronous sites (A and B) and the C site will dictate the complexities associated with backup and restore. Longer lags could lead to greater restore time and so tying the lag between replication and the operational procedures to handle a potential disaster is an absolute necessity.
The action item for users is that core industries must move in this direction today. Non-core industries should watch how these core industries progress, not only from a technology perspective but based on the practices they utilize to actually handle high quality, provable three-node disaster recovery capabilities.
The action for vendors is to find ways to introduce these capabilities into their solutions as fast as possible.
User actions for disaster recovery testing
In the 2/26/07 Wikibon Storage Research meeting (see Storage menage a trois for disaster recovery testing) we concluded that certain core infrastructure industries need to consider architecting a three-node data center to improve testing and resiliancy.
Less regulated industries should monitor the progress of these leaders to understand:
- The maturity of technologies used
- The processes used to support high quality three-node solutions
- The practices put in place to test and demonstrate the viability of these systems to corporate boards
There are two additonal considerations:
- Line of business buy-in and key metrics must be analyzed before system architectures can be completed-- IT departments should not go it alone.
- Users should begin to require third data center roadmaps be included in vendor RFP's so they can assess supplier direction and viability in this critical area.
Action item: The business impact analysis (BIA) should carefully assess lag times for the C site and understand the relationship between these lag times and recovery. The longer the lag times the greater the amount of data that will need to be restored and the more manual intervention required.
Organization implications for three-node disaster recovery
In the 2/26/07 Wikibon Storage Research meeting (see Storage ménage à trois for disaster recovery testing we concluded that certain core infrastructure industries and their suppliers need to consider architecting a three-node data center to improve testing and resiliency. We also concluded that this technology would become more widely used over the next five years as these technologies and services become higher volume & lower cost solutions.
We see a number of organization implications for this, so that if and when necessary, organizations can adopt such technologies with the minimum effort. Organizations should ensure that they have robust procedures for ensuring that their BIA (Business Impact Analysis) is kept up to date, and is signed off by senior management. In many organizations this task will be assigned to the CSO (Chief Security Officer). Organizations also need to ensure that there is a robust relationship between the CSO and the CTO (Chief Technical Officer). The CTO needs to take a clear mission to ensure that Business Continuance architectures can accommodate three-node disaster recovery topologies over the long term. There should be a mid-to long term plan for remediating systems and applications that do not comply with architectural standards.
The physical positioning of data centers and/or the location of third-party disaster recovery services is crucial. Once established, these sites are difficult, expensive and time consuming to change.
Action Item: Location decisions should ensure that three-node disaster recovery topologies can be easily set up when necessary. For example, it is better to have two data centers in one city and the “C” site in another rather than in three separate cities.
Technology implications for three-node disaster recovery
In the 2/26/07 Wikibon Storage Research meeting (see Storage ménage à trois for disaster recovery testing we concluded that certain core infrastructure industries and their suppliers need to consider architecting a three-node data center to improve testing and resiliency. We also concluded that there will be significant internal technology constraints to the adoption of this approach by organizations.
To ensure that they can position themselves in the future for such technologies, organizations should ensure that all key applications and their support applications are identified, and the constraints to adoption of three-node topologies understood and documented. They must also ensure that all new systems do not contain constraints to three-node DR, and when possible and appropriate ensure that existing key systems are modified to remove those constraints.
Key action item: Ensure that there is a robust relationship between the business impact analysis (BIA) outputs and the DR architectures that can minimize the impacts. In particular, ensure that the distance of the “C” site is not so far away that it extends recovery times and compromises the recovery objectives.
Vendor actions for three-node data center solutions
In the 2/26/07 Wikibon Storage Research meeting (see Storage ménage à trois for disaster recovery testing) we concluded that certain core infrastructure industries need to consider architecting a three-node data center to improve testing and resiliancy.
As a result, we believe vendors should aggressively tune and publish three-node data center roadmaps and technology directions. Successful vendors will architect solutions that reduce transmission costs as this is the main gate to lag times at the C site. This includes 'pull' versus 'push' technologies for replication and disaster tolerance. Vendors who are in a position to incorporate these directions within RFP's will gain competitive advantage.
Action item: Suppliers should consider adding what we call 'C-site services' to their portfolios, specifically, services that provide on-demand access to third site recovery capabilities as an outsourced function. We expect C-site services to emerge over the next five years in a Software as a Service (SaaS) model led by the likes of Sunguard, IBM and perhaps others such as EMC.
Retire roadblock products in 3-node DR projects
While decisions regarding 3-node DR architectures should be conceived and executed on the basis of application needs (e.g., core revenue-based processes with high write-to-read ratios), the reality of today’s application portfolios is deep interconnectedness of function. The process of choosing which application function and data to support with a 3-node DR architecture usually reveals application resources that not only cannot be supported, but need to be retired. This typically includes software, network, and traditional storage technologies (e.g., legacy tape systems). To streamline the implementation process, as well as reap any savings made possible by exiting ancient maintenance contracts, requires a strong hand that is as focused and disciplined on getting rid of stuff as others are at procuring new products. The scope for this role must cut across storage, application, and network, or 3-node DR architecture projects will fail to deliver expected results.
Action Item: Before beginning to bring new 3-node DR technologies into the shop, empower someone with the mandate and resources required to get rid of the roadblock technologies that offer limited, other benefits.