Analyst: David Floyer
Moderator: Peter Burris
In certain core infrastructure industries (e.g. finance, insurance, telecommunications, health care and transport) and other industries that support them, regulatory regimes dictate that organizations must reliably demonstrate they have the capability to fail-over and fail-back at distances of hundreds of miles. The reality is that demonstrating and testing this capability is becoming increasingly difficult and is requiring new approaches to disaster recovery.
For example, certain regulations demand that organizations have active backups of data hundreds of miles away from the main source of data to ensure that a metropolitan disaster does not take out all of the data. Under these circumstances, users need to put in place a mechanism that allows not only for ensuring that data can be recovered but also that they can verify through reasonable testing approaches that they have an appropriate architecture in place to mitigate disasters. This in many respects is becoming an increasing challenge.
What we see large organizations doing is essentially setting up a system of data replication that utilizes three different data centers. Two of the data centers are synchronized (A and B) such that data written to the main center is at the same time written in a synchronous manner to the backup site. That way, if there is a problem with A, fail-over to B is relatively instantaneous but more importantly, fail back, once A comes back on line also can be handled in a very smooth and seamless fashion.
However, because of the distance problem (i.e. synchronized storage nodes must be close enough in proximity to mitigate speed-of-light issues) sites A and B are usually within the same metropolitan area which violates the rules associated with placement of data centers. As a consequence, many organizations are introducing a third site (C) that is governed by a small lag between the B and C sites (typically), usually in the order of five to thirty minutes, but ensuring that the C site is far enough outside that same metropolitan area (usually over 200 miles to comply with regulations). This ensures that in the event of a rolling disaster that takes out site A and threatens site B, site C can have a a perfect copy of the data within (say) five minutes. As no data is lost, fail-over can be tested completely, and failed-back. This ensures that testing of the entire organization of data is reliable and can be accomplished with confidence.
Not all organizations need to consider this architecture today. However over the next few years, technologies will become more widely available and standard for constructing a three-node data center solution in a relatively inexpensive way. Users should be evaluating vendors in part based on their product road maps for introducing this capability.
The other key user issue that requires consideration is that the lag between paired synchronous sites (A and B) and the C site will dictate the complexities associated with backup and restore. Longer lags could lead to greater restore time and so tying the lag between replication and the operational procedures to handle a potential disaster is an absolute necessity.
The action item for users is that core industries must move in this direction today. Non-core industries should watch how these core industries progress, not only from a technology perspective but based on the practices they utilize to actually handle high quality, provable three-node disaster recovery capabilities.
The action for vendors is to find ways to introduce these capabilities into their solutions as fast as possible.
Who are the vendors who can do this? I know IBM has this capability and probably EMC...are there others? --ITGuru 15:14, 2 March 2007 (CST)
Action Item:
Footnotes: