On April 10, 2012, the Wikibon community held a Peer Incite to discuss how to create a zero-data-loss environment for information technology. We were joined by Tim Hays, VP of IT at Animal Health International, a distributor of food and animal health products.
Many organizations view disaster recovery as an insurance policy. The executive team, the business units, the customers, and the partners may all wish that the organization had 100% uptime and 100% data protection for their IT systems, but no organization can successfully operate with zero tolerance for risk, and no organization has an unlimited budget for insurance.
If more organizations had a larger disaster recovery budget, then more would have already adopted what has become best-practice among the largest institutions in the financial services industry. These institutions maintain two data centers located within synchronous-replication distances and a third data center or disaster recovery facility at an extended asynchronous distance.
This approach, which is the logical conclusion from a mega-bank’s business impact analysis (BIA), enables rapid local recovery and zero data loss for many, but not all, disaster events, and somewhat longer recovery times with some data loss for major regional disasters. That said, even mega-banks cannot afford to protect all applications using this approach. Therefore, the BIA requires an application-by-application and process-by-process analysis that, due to the dynamic nature of both applications and processes, must be frequently updated.
Outside of highly-regulated industries, such as financial services, and industries that have extremely high-frequency and/or high-value transactions and a substantial profit engine to protect, conducting and maintaining a detailed process-by-process BIA is unmanageable. In addition, a three-data center approach is unaffordable and simply too much insurance for the multitude of mid-market organizations. Tim Hays, and his management at Animal Health International, concluded as much, when they did their own back-of-the-napkin business impact analysis.
Animal Health’s primary production data center is located in a region isolated from hurricanes and tsunamis. Earthquakes, floods, and other natural disasters that can impact data center availability are relatively rare. That said, the organization did understand that fires, floods, earthquakes, and tornadoes, together with the concern over regional power and telecommunications outages, represent risk that can and should be managed, even if not eliminated. As a result, the organization settled on a more-affordable, two-data center asynchronous replication approach that would enable relatively rapid restoration of applications, provide some separation between the data centers, but guarantee that some data would be lost in a disaster.
The selected approach was enabled by EMC’s RecoverPoint software that provides periodic, application-consistent snapshots and replication of data between EMC CLARiiON and/or VNX storage systems. Animal Health essentially decided to live with a data-loss exposure window of approximately 30 minutes, recognizing that, because of the online and direct-order-entry nature of the company’s business, lost or corrupted data could not be reconstructed from other available documents or sources.
This maximum tolerable data loss was based upon the value of the electronic transactions (approximately $5 million per day), the probability of a data-loss-producing disaster, and an analysis of the additional cost of infrastructure and increased telecommunications bandwidth that would be required to close the data-loss exposure further. In short, it was as much insurance as the organization was willing to purchase.
Upon learning of an enhancement to the RecoverPoint offering, Tim Hays, however, decided to augment his approach with the Phoenix System RP, available from EMC Select Partner, Axxana. This solution can maintain all of the company’s at-risk data in a disaster-proof enterprise data recorder, much like a flight data recorder that maintains airplane data through disasters.
The Phoenix System RP provides similar levels of data protection through extremes of heat, water exposure, fire, smoke, crushing, and piercing forces. By protecting the otherwise-exposed data, transactions that have not yet replicated to the second site can be delivered either physically, over a wired network, or wireless over a cellular network, in the event of a disaster. This eliminates data-loss risk across a much broader range of disaster scenarios, and dramatically simplifies disaster recovery planning, disaster recovery processes, and disaster recovery testing.
Because of the high degree of integration between Axxana’s Phoenix System RP and RecoverPoint, implementation was simple. Once installed, administrators simply use the RecoverPoint user interface and select the Axxana option to enable zero data loss for the application.
Though it may at first seem surprising, Animal Health International chose to protect data not only for applications that a mega-bank-style BIA would have deemed critical, but also for applications that a BIA might have determined to be less critical, such as application development and test. As Tim discussed, many organizations fail to consider the business impact of having all of your developers unable to work.
With all applications equally and completely protected, Animal Health International also avoids the time and expense of having to frequently revisit the business impact analysis and avoids the risk associated with potential misclassification of applications that support a business process. As Tim described, business processes are complex. For example, orders at Animal Health International come in through email, fax servers, direct wired-terminal input, from work-at-home employees, from customer terminals, and from mobile route vans. The chance that some application that supports the order-entry process may be overlooked is very high. The protect-everything-equally approach is simpler, less risky and, because of the technology, affordable insurance.
Action Item: Regardless of the current state of an organization’s disaster recovery plan, this approach warrants consideration. Organizations that have “settled” for asynchronous replication due to bandwidth and infrastructure costs now have an affordable alternative. Organizations that have “invested” in three-data-center approaches may be able to reduce current expense and eliminate the risk of data loss in a region-wide disaster that impacts both synchronous-distance data centers. Organizations should consider the reduction in network costs, the value of simplicity, and the often-overlooked risk of misclassification when evaluating this approach.
Footnotes: Disclosure: Walden Technology Partners, Inc., the author's firm, provides retainer and project-based consulting services to technology companies, including Axxana. A list of current Walden Technology Partners, Inc. consulting services clients can be found here.