Storage Peer Incite: Notes from Wikibon’s April 10, 2012 Research Meeting
Recorded audio from the Peer Incite:
Zero data loss has traditionally been the province of very high performance, high value computing environments such as currency traders, where the loss if even a few seconds of data could cost millions. And those companies use an exotic and very expensive solution involving three complete data centers, two within synchronous distance from each other, say Manhattan and Newark, and the third far enough away to be unaffected by a regional disaster, for instance Dublin. That architecture remains the gold standard both for minimal data loss and for minimal recovering times.
Today, however, virtually every business that conducts significant amounts of business over the Internet needs to consider a zero data loss DR solution to avoid losing valuable transaction records in a disaster such as a fire. This includes many SMBs and others who cannot afford the exotic kinds of business, ranging from retailers to business-to-business manufacturers are service providers.
That is exactly the problem that Tim Hayes, VP of IT at Animal Health International, faced. He discussed the creative solution, the Axxana "black box" backup appliance, that he found and which he described as a "no brainer", at the April 10 Peer Incite meeting.
The good news is that Axxana solves the data loss problem, although at the expense of a longer data recovery time. The bad news is that at this time Axxana only works with EMC technology, so companies that use other storage vendors need to find other solutions, at least until Axxana expands its technology to work with other vendor products.
The message from the Peer Incite is that if your organization does an important amount of its business over the Internet and is still depending on traditional tape-based backup as its only DR solution, you need to reevaluate your recovery plan. And three-data-center architectures are no longer the only solutions available. Bert Latamore, Editor
On April 10, 2012, the Wikibon community held a Peer Incite to discuss how to create a zero-data-loss environment for information technology. We were joined by Tim Hays, VP of IT at Animal Health International, a distributor of food and animal health products.
Many organizations view disaster recovery as an insurance policy. The executive team, the business units, the customers, and the partners may all wish that the organization had 100% uptime and 100% data protection for their IT systems, but no organization can successfully operate with zero tolerance for risk, and no organization has an unlimited budget for insurance.
If more organizations had a larger disaster recovery budget, then more would have already adopted what has become best-practice among the largest institutions in the financial services industry. These institutions maintain two data centers located within synchronous-replication distances and a third data center or disaster recovery facility at an extended asynchronous distance.
This approach, which is the logical conclusion from a mega-bank’s business impact analysis (BIA), enables rapid local recovery and zero data loss for many, but not all, disaster events, and somewhat longer recovery times with some data loss for major regional disasters. That said, even mega-banks cannot afford to protect all applications using this approach. Therefore, the BIA requires an application-by-application and process-by-process analysis that, due to the dynamic nature of both applications and processes, must be frequently updated.
Outside of highly-regulated industries, such as financial services, and industries that have extremely high-frequency and/or high-value transactions and a substantial profit engine to protect, conducting and maintaining a detailed process-by-process BIA is unmanageable. In addition, a three-data center approach is unaffordable and simply too much insurance for the multitude of mid-market organizations. Tim Hays, and his management at Animal Health International, concluded as much, when they did their own back-of-the-napkin business impact analysis.
Animal Health’s primary production data center is located in a region isolated from hurricanes and tsunamis. Earthquakes, floods, and other natural disasters that can impact data center availability are relatively rare. That said, the organization did understand that fires, floods, earthquakes, and tornadoes, together with the concern over regional power and telecommunications outages, represent risk that can and should be managed, even if not eliminated. As a result, the organization settled on a more-affordable, two-data center asynchronous replication approach that would enable relatively rapid restoration of applications, provide some separation between the data centers, but guarantee that some data would be lost in a disaster.
The selected approach was enabled by EMC’s RecoverPoint software that provides periodic, application-consistent snapshots and replication of data between EMC CLARiiON and/or VNX storage systems. Animal Health essentially decided to live with a data-loss exposure window of approximately 30 minutes, recognizing that, because of the online and direct-order-entry nature of the company’s business, lost or corrupted data could not be reconstructed from other available documents or sources.
This maximum tolerable data loss was based upon the value of the electronic transactions (approximately $5 million per day), the probability of a data-loss-producing disaster, and an analysis of the additional cost of infrastructure and increased telecommunications bandwidth that would be required to close the data-loss exposure further. In short, it was as much insurance as the organization was willing to purchase.
Upon learning of an enhancement to the RecoverPoint offering, Tim Hays, however, decided to augment his approach with the Phoenix System RP, available from EMC Select Partner, Axxana. This solution can maintain all of the company’s at-risk data in a disaster-proof enterprise data recorder, much like a flight data recorder that maintains airplane data through disasters.
The Phoenix System RP provides similar levels of data protection through extremes of heat, water exposure, fire, smoke, crushing, and piercing forces. By protecting the otherwise-exposed data, transactions that have not yet replicated to the second site can be delivered either physically, over a wired network, or wireless over a cellular network, in the event of a disaster. This eliminates data-loss risk across a much broader range of disaster scenarios, and dramatically simplifies disaster recovery planning, disaster recovery processes, and disaster recovery testing.
Because of the high degree of integration between Axxana’s Phoenix System RP and RecoverPoint, implementation was simple. Once installed, administrators simply use the RecoverPoint user interface and select the Axxana option to enable zero data loss for the application.
Though it may at first seem surprising, Animal Health International chose to protect data not only for applications that a mega-bank-style BIA would have deemed critical, but also for applications that a BIA might have determined to be less critical, such as application development and test. As Tim discussed, many organizations fail to consider the business impact of having all of your developers unable to work.
With all applications equally and completely protected, Animal Health International also avoids the time and expense of having to frequently revisit the business impact analysis and avoids the risk associated with potential misclassification of applications that support a business process. As Tim described, business processes are complex. For example, orders at Animal Health International come in through email, fax servers, direct wired-terminal input, from work-at-home employees, from customer terminals, and from mobile route vans. The chance that some application that supports the order-entry process may be overlooked is very high. The protect-everything-equally approach is simpler, less risky and, because of the technology, affordable insurance.
Action item: Regardless of the current state of an organization’s disaster recovery plan, this approach warrants consideration. Organizations that have “settled” for asynchronous replication due to bandwidth and infrastructure costs now have an affordable alternative. Organizations that have “invested” in three-data-center approaches may be able to reduce current expense and eliminate the risk of data loss in a region-wide disaster that impacts both synchronous-distance data centers. Organizations should consider the reduction in network costs, the value of simplicity, and the often-overlooked risk of misclassification when evaluating this approach.
Footnotes: Disclosure: Walden Technology Partners, Inc., the author's firm, provides retainer and project-based consulting services to technology companies, including Axxana. A list of current Walden Technology Partners, Inc. consulting services clients can be found here.
This week, I participated in the Wikibon Peer Incite entitled "Creating a Zero Data Loss Environment" during which Tim Hays, VP of Information Technology at Animal Health International, discussed his company’s implementation of Axxana – essentially a data center "black box" technology – in order to create a zero-data-loss environment. The discussion left clear takeaways that make such endeavors interesting and, in the right circumstances, almost a no brainer.
Axxana has created a product that leverages EMC’s RecoverPoint technology to create what amounts to a zero-data-loss situation. With the Axxana and RecoverPoint combination, zero data loss can be achieved even over long distances.
While the Axxana solution currently requires RecoverPoint in order to work, let’s consider some of the opportunities that such a scenario might provide. First of all, when it comes to disaster recovery, the sheer cost of implementing the perfect scenario is often prohibitive. When one considers disaster recovery as nothing more than an insurance policy, organizations have to make tough choices about how much insurance they’d like to buy. When we spoke this week with Mr. Hays, as he compared a number of different complete disaster recovery options, he described the Axxana solution as a “no brainer” when compared with the cost of what it would take to implement a three-data-center solution. In fact, the massive communications costs alone would have been far more expensive than what Animal Health International ended up paying for the Axxana solution.
When CIOs read papers regarding these kinds of initiatives, the general guidance is to perform a number of analyses, including a business impact analysis, which leads to decisions around appropriate recovery point objectives and recovery time objectives. Mr. Hays was extremely fortunate. The zero data loss solution is about as good as it gets. So, if the cost differential is so insane that it becomes a no brainer to go with the best insurance possible, do it. This was the situation that Mr. Hays found himself in.
Personally, I live those kinds of decisions! While it’s great to follow the “best practices” and expend a ton of effort, when you don’t have to bother, it’s wonderful.
Further, when it comes time to determine how much insurance you want in your disaster recovery plan, the increasing complexity of modern data systems can make this analysis incredibly difficult. More and more, it’s incredibly difficult to completely decouple systems from one another. Applications, processes and services are so intertwined that attempting to make granular decisions about which applications need to be considered mission critical can be difficult, if not impossible. Further, this application tangling adds much more complexity on the recovery side of the equation, particularly if you attempt to decouple them to keep DR costs down.
What if you could just forget having to think about what should and should not be considered mission critical? What if you could just say “protect it all” and be done with it? What if you didn’t have to worry about tangled apps? Because of the cost-effectiveness of the Axxana solution, Mr. Hays was able to do exactly this. He’s protecting both his production and development environments with the solution and is doing so in a complete way.
There remain a great number of organizations out there that have either poor or no disaster recovery plans. The reasons vary, but two big factors are cost and complexity, the two issues that Mr. Hays has been able to eliminate as factors from his decision.
There is a common theme here that must be observed: Simplicity. Mr. Hays was able to eschew what is often considered the “best practice” with regard to disaster recovery and, rather than have to spend weeks developing complex business justifications metrics and making decisions around what could and could not be “insured”, use a simply back-of-the-napkin cost/benefit approach and say "protect it all." Again, this is not due to laziness or lack of judgment... in fact this is due to the no-brainer aspect of the zero-data-loss solution he chose.
Of course, not everyone will have this luxury. If you’re not an EMC customer or don’t want to be locked into EMC for whatever reason, today’s Axxana solution is not for you. That said, there are other solutions out there that may prove to be just as obvious as the one chosen by Mr. Hays. By moving forward with this simple, elegant solution, Mr. Hays’ team is able to spend much more time on the truly important side of the disaster recovery equation, which is recovery. Further, the time that might have been spent planning and constantly testing a traditional complicated disaster recovery solution can be spent on business-facing value-add initiatives instead.
Action item: That was a major take away for me this week. Simplicity. It’s so easy to overengineer and overthink solutions and create structures that are barely sustainable. When possible, go for the simple; go for the elegant. Go for solutions that make you say, "Well, duh!" and spend more time making sure the business remain operational and gets better.
Tim Hayes gave a glowing report on the Axxana zero data loss solution during the April 10 Peer Incite. And Wikibon's experts agree. Wikibon CTO David Floyer, who gave Axxana an annual Wikibon CTO Award for its innovative technology, has said publicly that the only thing that surprises him is the market's slow adoption rate of Axxana.
Axxana is so comparatively inexpensive that Hayes said the savings in transmission costs alone can pay for it. This is because a two-data-center synchronous connection has to be sized for the highest possible expected data load. That priced Animal Health International out of the synchronous dual data center market. Axxana allows Animal Health to use a data connection sized for the average data volume by allowing it to hold excess data in the Axxana box until it can be transmitted.
And Axxana is so simple to add to Animal Health's EMC installation that Hayes said it literally was a simple check box in the EMC RecoverPoint solution Animal Health had installed to provide its basic backup and recovery.
So what is the bad news? Axxana only works with RecoverPoint. So while it provides what Hays called a “no brainer” solution for EMC customers, it doesn't help the rest of the companies out there, at least for the present. Wikibon expects Axxana to expand its technology to work with other products eventually. But meanwhile perfectly financially healthy SMBs will continue to go into bankruptcy after a fire destroys their headquarters and their records, both digital and paper. And other companies will be damaged financially by the loss of a few hours of transactional data from their online stores after something as commonplace as a hard drive failure or data corruption requiring a reload of the last backup of the database – typically from the night before on tape.
Action item: The message for EMC customers is to consider Axxana. Even for those that do not need zero data loss, it solves so many DR problems and saves so much money just by eliminating the need for DR system tests, for instance, that it is worth considering, even for small companies. And customers of competing data storage systems need to push their vendors either to work with Axxana or provide a similar solution. Nightly backups are no longer enough in the age of Internet-based, self-service sales.
In choosing a disaster recovery solution, customers have long had to balance how much they are willing to spend to avoid data loss. There is no shortage of solutions from the vendor ecosystem for DR. However, there is little innovation in this space; most solutions use straightforward “brute force” methods of creating multiple sites to store data and utilizing deduplication and WAN optimization to reduce the cost of bandwidth between sites.
To create a zero data loss environment over long (asynchronous) distance, the options available are 3-node DR or Axxana’s “black box” technology. Since there is inherently risk and cost in getting data offsite, the innovative idea of Axxana’s solution is that it creates an environment that guarantees the protection of data locally, ensuring recoverability even if some data doesn't make it to the DR site. The inherent utility of a three-site DR solution is low, and it also requires high-cost bandwidth between all of the locations, which only the largest of organizations can cost justify. Going asynchronous can enable a 10X reduction in network expense. Software and hardware price adjustments only address a portion of the affordability issue.
As data is not becoming any less valuable, the vendor community needs to find creative solutions that meet the zero-data loss requirement and the price and utility needs of more companies. This can include partnering with Axxana to deliver platform support beyond EMC. Using an object-based solution as discussed in the Cloud Archiving Forever without Losing a Bit Peer Incite is another alternative.
Action item: The traditional disaster recovery solution market is ripe for change. Vendors will need a zero-data-loss offering to meet the competitive threat from Axxana/EMC, since data loss will become increasingly unacceptable. Vendors should look at Axxana as a creative example of how to find new ways to solve old challenges.
In April 2012, Wikibon held a Peer Incite on Zero Data Loss using EMC RecoverPoint together with Axxana’s “Black Box” technology that can retain and allow recovery of transient data through disaster such as fire, flood tournedos and earthquakes. Tim Hays, VP of IT at Animal Health International, a distributor of food and animal health products, talked about his experience with implementing the EMC/Axxana solution and provide a disaster recover solution that reflected their recovery priorities at a cost the business could afford.
In 2009, Wikibon published ”The economic value of Axxana's zero data loss solution”. This professional alert is an update to that piece, using updated assumptions and pricing.
The key conclusion that the case study of Animal Health International illustrates is that the EMC/Axxana solution makes zero data loss techniques cost effective for mid-sized companies who are outside the financial or banking industry.
In addition, it shows that in today’s IT environment all the application systems are increasingly closely intertwined; a zero data loss solution for all active data leads to a much simpler and efficient business recovery processes and gives better business protection than solution that focus on zero data loss and faster recovery for a subset of applications.
The economic model Wikibon developed uses off-site tape backup as the baseline to compare the costs and value of alternative disaster recovery solutions, including the EMC RecoverPoint and Axxana two-node zero data loss at any distance. Tape recovery was used as a baseline because it represents a lowest common denominator for comparison. Virtually all organizations can relate to, conceptualize and roughly quantify the costs and business process impacts associated with a tape-based recovery solution. As such this technique allows Wikibon to make mathematically consistent and defensible comparisons across any use case scenario.
Investment in disaster recovery (DR) is essentially insurance against unlikely events. To evaluate insurance it is useful to look at the expected loss1 of different alternatives and compare the costs associated with these alternatives. The expected loss comes from:
- Loss of data (e.g., orders, transfers etc. that are in process); the amount of data that will be lost if a disaster occurs and the time taken to restore the business service.
- Loss of IT service to personnel, customers, and partners.
The economic value model we developed captures several information sources necessary to evaluate the costs and benefits associated with virtually any disaster recovery (DR) solution. They include information necessary to determine both cost and expected loss in real value terms, as well as drive key assumptions relative to various DR scenarios which are described below.
Base Information Variables driving the Wikibon Model
The basic variables include:
- The revenue contribution of the applications supported by the DR infrastructure;
- The storage capacity being protected;
- The distance between the primary and backup sites (as the line lurches);
- I/O rate for applications being protected;
- The business impact of an outage;
- The probability of an outage;
- Line costs for the specific locations;
- Cost of primary storage;
- Data necessary to calculate NPV and other financials.
DR Alternatives Assessed by Model
The model addresses a spectrum of five (5) main scenarios:
- Base Case of Offsite Tape Only:
- This is a well understood, low cost and time honored method of managing disaster recovery. A coherent set of data is written to tape and transported off-site. The amount of data that would be lost in a disaster (Recovery Point Objective, RPO) is determined by how often the system is backed up and how quickly it is transported off-site. If a system is backed up daily, and it takes six hours to move the data off-site, then the average amount of data lost will be that created an 18 hours production window. There are new technologies for improving the time to transfer the data (snapshots, de-duplication, and transmission over the wire) but the fundamentals have not changed significantly.
- The Recovery Time Objective (RTO) to bring up the system again on another system, reconcile the data lost and bring it back into full production is usually measured in days.
- Two Data Center Synchronous:
- Creating a second copy of the data in a second site at the same time as the original data is created means that if the primary site is lost, the second site can recover without any data being lost. This RPO-zero solution is well understood and used extensively in the financial sector. Usually the second data center is fully equipped with servers, and because no data is lost, recovery time (RTO) is usually quick and can be a matter of hours. However, the distance that the two data centers can be apart is limited to less than 40 miles for most applications (see Asynchronous below). There is a significant chance that both sites will be taken out by a rolling disaster (disaster being defined in a broad sense that could, for example include local unrest, union action, etc.) Then recovery would have to be made from a normal tape backup. Financial watchdogs such as the SEC strongly recommend distances of 200 miles or more between data centers to eliminate the risk of a rolling disaster, and this makes Two Data Center Synchronous alone not viable for large and/or non-local organizations. The Two Data Center Asynchronous or Three Data Center solutions (see below) are the normal alternatives
- Two Data Center Asynchronous:
- The ideal location for a second data center is usually hundreds of miles away from the primary site. As discussed above, the SEC and other regulatory bodies strongly recommend that a recovery solution include a data center at least 200 miles from the primary site. However, for normal applications it is not possible to keep the data exactly consistent in both locations (if you wait for an acknowledgment that the data is safe at the second site, the delays in the transmission, even at near the speed of light, mean that wait time for I/O becomes unacceptably high, and system throughput slows to a crawl). An asynchronous DR solution keeps a small buffer of information at the primary end and ensures that a coherent set of data is transmitted to the other end. The RPO is much better than a tape backup solution, but the fact that some data is lost means that the RTO is longer than synchronous solutions because the databases need to be reconciled with other business records.
- Three Data Center DR solutions:
- Three data center solutions are a hybrid between synchronous and asynchronous solutions. A two data synchronous solution is set up between the primary and the “B” site less than 20 miles away, and a second asynchronous connection is set up with the remote site (“C” site). Either the B site is connected to the C site or the primary A site is connected to the C site or both. This cascaded or multi-hop approach ensures that most of the time failovers can occur to the B site without data loss, and in the case that both the primary A and B sites are taken out, the C site can recover with less data loss, much more quickly than a tape recovery approach. However the transmission line, data center infrastructure and storage costs of such solutions are very high, and this solution, therefore, is only used by a relatively small number of organizations (mostly financial).
- Two Data Center EMC RecoverPoint and Axxana Solution:
- The EMC RecoverPoint and Axxana solution is logically the same as a three data center solution. The difference is that a synchronous copy of the data that has not been sent to the remote site is held on the primary site. It is protected from a disaster not by distance, but by Axxana’s “black-box” technology that provides physical protection from fire, water or earthquake. In the event of a disaster, the data in the black-box can be recovered by Internet or cellular transmission and transmitted to the remote location. This enables a zero-data loss solution with two data centers at extended distances.
- RecoverPoint is an appliance with 3 types of splitters. These are host based, fabric based, and array based. RecoverPoint works with EMC and non-EMC array products except for the array-based splitter option.
Our conclusions focus on two main areas:
- What is the impact of EMC RecoverPoint and Axxana on benefits?
- What does the model say about the impact of EMC RecoverPoint and Axxana on costs?
At the highest level, the EMC RecoverPoint and Axxana technology brings the probability of losing data very close to zero at asynchronous distances. It provides the same level of business protection from data loss as a three data center solution and a higher level of protection than either synchronous or asynchronous solutions. Because no data is lost, RTO time should also be better than asynchronous solutions. RTO will be slower than synchronous solutions if the second site is unaffected but much faster in the case that both synchronous sites are affected by a disaster. Conceptually, compared to alternatives, the EMC RecoverPoint and Axxana approach simplifies implementation and testing of near zero data loss solutions.
The main impacts of an EMC RecoverPoint and Axxana solution on cost are:
- It reduces line costs by decreasing the peak threshold required for a desired service level;
- It reduces the cost of storage because less redundancy is needed to meet the same recovery objectives;
- It simplifies the set up and operating environment;
- It allows much easier testing of DR function.
In theory, the EMC RecoverPoint and Axxana approach will allow organizations to eliminate or avoid building an entire data center (e.g. the B site) in a three data center solution. However the solution must be proven in the market before this strategy is widely adopted.
The EMC RecoverPoint and Axxana solution is not appropriate for very small systems (e.g. below about 20TB) where the cost of the EMC RecoverPoint and Axxana solution is higher than a simple replicated solution and the cost of EMC RecoverPoint and Axxana would, therefore, exceed the benefits.
In order to illustrate the economics of EMC RecoverPoint and Axxana's solution, we have run the following case example through the model. The customer profile (similar to, but the the same as Animal Health International) is:
- A mid-sized organization with revenues of $500 million;
- Core business applications and Microsoft support applications;
- Two locations about 80 miles apart;
- A tape-based backup and recovery system
The impact of EMC RecoverPoint and Axxana's solution on the benefit side is notable and essentially identical to non-tape alternatives. Specifically, Figure 1 shows these benefits relative to alternative DR approaches. The primary benefit calculated is the reduction in expected loss (i.e. the lower probability of losing data) as a result of putting in place a disk-based recovery solution (synchronous, asynchronous or three data center). In each scenario, the benefits of the target DR solution are based on a comparison to tape-based recovery. As such, relative to tape-based recovery, all solutions show substantial benefits from a reduction in expected loss.
This factor is due primarily to the following points:
- The RTO of all alternative scenarios is dramatically improved over tape's 96 hours;
- The RPO in all alternative cases is dramatically improved relative to tape's 18 hours of data loss;
- The expected loss of the asynchronous solution is greater than the solutions with a synchronous component;
- The simplification of business recovery processes from zero data loss did not accrue to the asynchronous solution.
Overall, all disk-based solutions were significantly better than the current tape-based solution, and EMC RecoverPoint and Axxana demonstrates benefits that are equal to or greater than alternatives.
As seen in Figure 2, EMC RecoverPoint and Axxana's solution has a lower cost-of-ownership than alternative disk-based DR solutions. Our analysis for this specific example shows the following:
- Costs for EMC RecoverPoint and Axxana's solution are approximately $0.8M lower than those required to run asynchronous or synchronous data protection;
- Costs for EMC RecoverPoint and Axxana are nearly $8M lower than those required to run a 3-node data center solution.
In our assessment, EMC RecoverPoint and Axxana will have the lowest cost of staff, because the solution is simpler to install, test, and manage. For example, an EMC RecoverPoint and Axxana approach reduces the amount of equipment needed to be managed. In a two data center solution, exact copies of servers are required in two sites and in a three data center approach, three sets of servers are needed. In addition, the cost of lines is lower for EMC RecoverPoint and Axxana over asynchronous distances because of a reduced peak bandwidth requirement. Two data center solutions require dark fibre over shorter distances, increasing costs.
EMC RecoverPoint software was assumed for the EMC/Axxana solution and EMC SRDF or HDS TrueCopy was assumed for alternatives, meaning fewer copies of data were required for EMC RecoverPoint and Axxana. The footnotes provide additional detail about the inputs and assumptions uses for the model.
Summary and Conclusions from Model
The question the model attempts to address is: Relative to advanced disk-based DR solutions, how does EMC RecoverPoint and Axxana fare? From the case study above and other analysis using the model, the following key points are highlighted:
- The EMC RecoverPoint and Axxana approach decreases the expected loss relative to asynchronous solutions;
- The EMC RecoverPoint and Axxana solution provides risk reduction substantially similar to both synchronous and 3-node data center approaches at a much lower cost;
- As a result, despite the higher costs for disk-based DR solutions, for environments with high data value the ROI of all these solutions is evident.
- The incremental CAPEX and OPEX of EMC RecoverPoint and Axxana's solution is much lower than alternatives.
Figure 3 shows the 3-year net present value of the EMC RecoverPoint and Axxana solution as about $9.1m higher than an asynchronous solution, and over $20m higher than a 3-data center solution.
The bottom line is EMC RecoverPoint and Axxana's approach appears to substantially cut the cost of achieving near-zero data loss and can do so at asynchronous distances, dramatically decreasing infrastructure costs relative to 3-node data centers and reducing proximity risk as commonly seen in synchronous operations.
Wikibon will be happy to run the model for Wikibon clients.
Wikibon analysts have extensive experience in assessing the economic value of disaster recovery solutions. Our experts have studied this issue for more than a decade and have constructed dozens of models to support large financial institutions and a variety of cross-industry organizations. We have done so in both mainframe and non-mainframe environments and studied virtually every vendor's solution in this space.
Axxana is a startup and must prove to us and the world that it can execute on its vision of providing high quality disaster recovery solutions at substantially reduced operational costs. Axxana faces several hurdles in this regard including product stability, channel uptake, the ability to evolve its product and customers willingness to fit the solution into their business processes, or potentially alter processes to fit the solution.
Nonetheless, on balance we are impressed with the Axxana management team. We feel they are capable of securing the continued funding necessary to execute and have the wherewithal to deliver on the company's vision.
Action item: For most businesses, the potential loss of brand image and customer/partner trust from lost data is significantly greater than slightly improved times to start doing business again. Technology has very significantly reduced the cost of zero data loss solutions and can be justified from simpler and more efficient business processes. CEO and CIOs should focus on creating a long-term strategy for providing a zero data loss solution for all active data.
Footnotes: 1 Expected loss is calculated from the sum of all disastrous events that can occur multiplied by the probability of an event occurring within a given time interval. The formula below is used within the model to calculate the expected annual loss from n events which lead to loss of data and IT services not being provided, and which have a financial impact ($Impact), and have a probability (p) that the event will occur within a year. Insurance companies are a good source of information about the probability of different events.
For years, certain industries, such as finance, insurance, telco, health care, transportation, and other supporting industries, have been required for regulatory and compliance reasons to demonstrate that they have the capability to fail-over and fail-back, often at asynchronous distances. Demonstrating proof for this requirement has become increasingly challenging for organizations, as failing over to the backup site cannot be tested thoroughly due to the exposure to production systems during the test process. Specifically, if something goes wrong in the testing process, the production systems could be damaged or lost permanently.
As such, what many companies do in this situation is test DR with a historical backup copy of the production system, not the actual production system itself. While this approach approximates a disaster scenario, many feel it is inadequate to simulate a true-life disaster.
The response for many organizations has been to install three-site data centers where two locations are within synchronous distance and a third data center is placed many hundreds of miles away, out of the “synchronous danger zone.” This approach provides near zero RPO synchronously while at the same time enabling a third data center to add another layer of protection in case of a calamitous disaster locally. It also enables more adequate testing of DR processes, however it is extremely expensive and only appropriate for the most valuable applications.
At the April 10, 2012 Wikibon Peer Incite Research Meeting, we heard from an industry practitioner that installing a locally synchronous “black box” from a company called Axxana, that is nearly indestructible has allowed his organization to avoid the exorbitant costs of a three-site data center approach while enabling proper DR testing.
The nearly indestructible Axxana system captures the RPO delta from the IT shop's continuous data protection (CDP) solution which takes snapshots in 15 minute intervals. The Axxana solution fills the DR gaps and provides the protection of a three-site setup at about one third of the cost.
The solution provides the best of both worlds and is attractive, especially for mid-sized companies, because it allows for three-site data center class protection using two data centers at asynchronous distances. Because the Axxana system is part of the normal IT operations, the approach can allow practitioners to eliminate risky DR testing and instead make DR testing a routine part of IT operations.
The catch today is that the Axxana system is narrowly focused on enabling EMC’s RecoverPoint-based arrays only. Over time, as the product is proven, we expect the system to target a much broader base of platforms. Nonetheless, the approach is intriguing and when compared with alternative zero data loss solutions, it appears to have great potential.
Action item: Proper DR testing has become a risky proposition for many companies; especially those that cannot afford three-site data center protection. IT organizations should consider new approaches such as using nearly indestructible infrastructure locally (e.g. the Axxana black box idea) and leveraging asynchronous distance to cut costs, reduce testing risk and improve disaster tolerance.