In April 2012, Wikibon held a Peer Incite on Zero Data Loss using EMC RecoverPoint together with Axxana’s “Black Box” technology that can retain and allow recovery of transient data through disaster such as fire, flood tournedos and earthquakes. Tim Hays, VP of IT at Animal Health International, a distributor of food and animal health products, talked about his experience with implementing the EMC/Axxana solution and provide a disaster recover solution that reflected their recovery priorities at a cost the business could afford.
In 2009, Wikibon published ”The economic value of Axxana's zero data loss solution”. This professional alert is an update to that piece, using updated assumptions and pricing.
The key conclusion that the case study of Animal Health International illustrates is that the EMC/Axxana solution makes zero data loss techniques cost effective for mid-sized companies who are outside the financial or banking industry.
In addition, it shows that in today’s IT environment all the application systems are increasingly closely intertwined; a zero data loss solution for all active data leads to a much simpler and efficient business recovery processes and gives better business protection than solution that focus on zero data loss and faster recovery for a subset of applications.
The economic model Wikibon developed uses off-site tape backup as the baseline to compare the costs and value of alternative disaster recovery solutions, including the EMC RecoverPoint and Axxana two-node zero data loss at any distance. Tape recovery was used as a baseline because it represents a lowest common denominator for comparison. Virtually all organizations can relate to, conceptualize and roughly quantify the costs and business process impacts associated with a tape-based recovery solution. As such this technique allows Wikibon to make mathematically consistent and defensible comparisons across any use case scenario.
Investment in disaster recovery (DR) is essentially insurance against unlikely events. To evaluate insurance it is useful to look at the expected loss1 of different alternatives and compare the costs associated with these alternatives. The expected loss comes from:
- Loss of data (e.g., orders, transfers etc. that are in process); the amount of data that will be lost if a disaster occurs and the time taken to restore the business service.
- Loss of IT service to personnel, customers, and partners.
The economic value model we developed captures several information sources necessary to evaluate the costs and benefits associated with virtually any disaster recovery (DR) solution. They include information necessary to determine both cost and expected loss in real value terms, as well as drive key assumptions relative to various DR scenarios which are described below.
Base Information Variables driving the Wikibon Model
The basic variables include:
- The revenue contribution of the applications supported by the DR infrastructure;
- The storage capacity being protected;
- The distance between the primary and backup sites (as the line lurches);
- I/O rate for applications being protected;
- The business impact of an outage;
- The probability of an outage;
- Line costs for the specific locations;
- Cost of primary storage;
- Data necessary to calculate NPV and other financials.
DR Alternatives Assessed by Model
The model addresses a spectrum of five (5) main scenarios:
- Base Case of Offsite Tape Only:
- This is a well understood, low cost and time honored method of managing disaster recovery. A coherent set of data is written to tape and transported off-site. The amount of data that would be lost in a disaster (Recovery Point Objective, RPO) is determined by how often the system is backed up and how quickly it is transported off-site. If a system is backed up daily, and it takes six hours to move the data off-site, then the average amount of data lost will be that created an 18 hours production window. There are new technologies for improving the time to transfer the data (snapshots, de-duplication, and transmission over the wire) but the fundamentals have not changed significantly.
- The Recovery Time Objective (RTO) to bring up the system again on another system, reconcile the data lost and bring it back into full production is usually measured in days.
- Two Data Center Synchronous:
- Creating a second copy of the data in a second site at the same time as the original data is created means that if the primary site is lost, the second site can recover without any data being lost. This RPO-zero solution is well understood and used extensively in the financial sector. Usually the second data center is fully equipped with servers, and because no data is lost, recovery time (RTO) is usually quick and can be a matter of hours. However, the distance that the two data centers can be apart is limited to less than 40 miles for most applications (see Asynchronous below). There is a significant chance that both sites will be taken out by a rolling disaster (disaster being defined in a broad sense that could, for example include local unrest, union action, etc.) Then recovery would have to be made from a normal tape backup. Financial watchdogs such as the SEC strongly recommend distances of 200 miles or more between data centers to eliminate the risk of a rolling disaster, and this makes Two Data Center Synchronous alone not viable for large and/or non-local organizations. The Two Data Center Asynchronous or Three Data Center solutions (see below) are the normal alternatives
- Two Data Center Asynchronous:
- The ideal location for a second data center is usually hundreds of miles away from the primary site. As discussed above, the SEC and other regulatory bodies strongly recommend that a recovery solution include a data center at least 200 miles from the primary site. However, for normal applications it is not possible to keep the data exactly consistent in both locations (if you wait for an acknowledgment that the data is safe at the second site, the delays in the transmission, even at near the speed of light, mean that wait time for I/O becomes unacceptably high, and system throughput slows to a crawl). An asynchronous DR solution keeps a small buffer of information at the primary end and ensures that a coherent set of data is transmitted to the other end. The RPO is much better than a tape backup solution, but the fact that some data is lost means that the RTO is longer than synchronous solutions because the databases need to be reconciled with other business records.
- Three Data Center DR solutions:
- Three data center solutions are a hybrid between synchronous and asynchronous solutions. A two data synchronous solution is set up between the primary and the “B” site less than 20 miles away, and a second asynchronous connection is set up with the remote site (“C” site). Either the B site is connected to the C site or the primary A site is connected to the C site or both. This cascaded or multi-hop approach ensures that most of the time failovers can occur to the B site without data loss, and in the case that both the primary A and B sites are taken out, the C site can recover with less data loss, much more quickly than a tape recovery approach. However the transmission line, data center infrastructure and storage costs of such solutions are very high, and this solution, therefore, is only used by a relatively small number of organizations (mostly financial).
- Two Data Center EMC RecoverPoint and Axxana Solution:
- The EMC RecoverPoint and Axxana solution is logically the same as a three data center solution. The difference is that a synchronous copy of the data that has not been sent to the remote site is held on the primary site. It is protected from a disaster not by distance, but by Axxana’s “black-box” technology that provides physical protection from fire, water or earthquake. In the event of a disaster, the data in the black-box can be recovered by Internet or cellular transmission and transmitted to the remote location. This enables a zero-data loss solution with two data centers at extended distances.
- RecoverPoint is an appliance with 3 types of splitters. These are host based, fabric based, and array based. RecoverPoint works with EMC and non-EMC array products except for the array-based splitter option.
Our conclusions focus on two main areas:
- What is the impact of EMC RecoverPoint and Axxana on benefits?
- What does the model say about the impact of EMC RecoverPoint and Axxana on costs?
At the highest level, the EMC RecoverPoint and Axxana technology brings the probability of losing data very close to zero at asynchronous distances. It provides the same level of business protection from data loss as a three data center solution and a higher level of protection than either synchronous or asynchronous solutions. Because no data is lost, RTO time should also be better than asynchronous solutions. RTO will be slower than synchronous solutions if the second site is unaffected but much faster in the case that both synchronous sites are affected by a disaster. Conceptually, compared to alternatives, the EMC RecoverPoint and Axxana approach simplifies implementation and testing of near zero data loss solutions.
The main impacts of an EMC RecoverPoint and Axxana solution on cost are:
- It reduces line costs by decreasing the peak threshold required for a desired service level;
- It reduces the cost of storage because less redundancy is needed to meet the same recovery objectives;
- It simplifies the set up and operating environment;
- It allows much easier testing of DR function.
In theory, the EMC RecoverPoint and Axxana approach will allow organizations to eliminate or avoid building an entire data center (e.g. the B site) in a three data center solution. However the solution must be proven in the market before this strategy is widely adopted.
The EMC RecoverPoint and Axxana solution is not appropriate for very small systems (e.g. below about 20TB) where the cost of the EMC RecoverPoint and Axxana solution is higher than a simple replicated solution and the cost of EMC RecoverPoint and Axxana would, therefore, exceed the benefits.
In order to illustrate the economics of EMC RecoverPoint and Axxana's solution, we have run the following case example through the model. The customer profile (similar to, but the the same as Animal Health International) is:
- A mid-sized organization with revenues of $500 million;
- Core business applications and Microsoft support applications;
- Two locations about 80 miles apart;
- A tape-based backup and recovery system
The impact of EMC RecoverPoint and Axxana's solution on the benefit side is notable and essentially identical to non-tape alternatives. Specifically, Figure 1 shows these benefits relative to alternative DR approaches. The primary benefit calculated is the reduction in expected loss (i.e. the lower probability of losing data) as a result of putting in place a disk-based recovery solution (synchronous, asynchronous or three data center). In each scenario, the benefits of the target DR solution are based on a comparison to tape-based recovery. As such, relative to tape-based recovery, all solutions show substantial benefits from a reduction in expected loss.
This factor is due primarily to the following points:
- The RTO of all alternative scenarios is dramatically improved over tape's 96 hours;
- The RPO in all alternative cases is dramatically improved relative to tape's 18 hours of data loss;
- The expected loss of the asynchronous solution is greater than the solutions with a synchronous component;
- The simplification of business recovery processes from zero data loss did not accrue to the asynchronous solution.
Overall, all disk-based solutions were significantly better than the current tape-based solution, and EMC RecoverPoint and Axxana demonstrates benefits that are equal to or greater than alternatives.
As seen in Figure 2, EMC RecoverPoint and Axxana's solution has a lower cost-of-ownership than alternative disk-based DR solutions. Our analysis for this specific example shows the following:
- Costs for EMC RecoverPoint and Axxana's solution are approximately $0.8M lower than those required to run asynchronous or synchronous data protection;
- Costs for EMC RecoverPoint and Axxana are nearly $8M lower than those required to run a 3-node data center solution.
In our assessment, EMC RecoverPoint and Axxana will have the lowest cost of staff, because the solution is simpler to install, test, and manage. For example, an EMC RecoverPoint and Axxana approach reduces the amount of equipment needed to be managed. In a two data center solution, exact copies of servers are required in two sites and in a three data center approach, three sets of servers are needed. In addition, the cost of lines is lower for EMC RecoverPoint and Axxana over asynchronous distances because of a reduced peak bandwidth requirement. Two data center solutions require dark fibre over shorter distances, increasing costs.
EMC RecoverPoint software was assumed for the EMC/Axxana solution and EMC SRDF or HDS TrueCopy was assumed for alternatives, meaning fewer copies of data were required for EMC RecoverPoint and Axxana. The footnotes provide additional detail about the inputs and assumptions uses for the model.
Summary and Conclusions from Model
The question the model attempts to address is: Relative to advanced disk-based DR solutions, how does EMC RecoverPoint and Axxana fare? From the case study above and other analysis using the model, the following key points are highlighted:
- The EMC RecoverPoint and Axxana approach decreases the expected loss relative to asynchronous solutions;
- The EMC RecoverPoint and Axxana solution provides risk reduction substantially similar to both synchronous and 3-node data center approaches at a much lower cost;
- As a result, despite the higher costs for disk-based DR solutions, for environments with high data value the ROI of all these solutions is evident.
- The incremental CAPEX and OPEX of EMC RecoverPoint and Axxana's solution is much lower than alternatives.
Figure 3 shows the 3-year net present value of the EMC RecoverPoint and Axxana solution as about $9.1m higher than an asynchronous solution, and over $20m higher than a 3-data center solution.
The bottom line is EMC RecoverPoint and Axxana's approach appears to substantially cut the cost of achieving near-zero data loss and can do so at asynchronous distances, dramatically decreasing infrastructure costs relative to 3-node data centers and reducing proximity risk as commonly seen in synchronous operations.
Wikibon will be happy to run the model for Wikibon clients.
Wikibon analysts have extensive experience in assessing the economic value of disaster recovery solutions. Our experts have studied this issue for more than a decade and have constructed dozens of models to support large financial institutions and a variety of cross-industry organizations. We have done so in both mainframe and non-mainframe environments and studied virtually every vendor's solution in this space.
Axxana is a startup and must prove to us and the world that it can execute on its vision of providing high quality disaster recovery solutions at substantially reduced operational costs. Axxana faces several hurdles in this regard including product stability, channel uptake, the ability to evolve its product and customers willingness to fit the solution into their business processes, or potentially alter processes to fit the solution.
Nonetheless, on balance we are impressed with the Axxana management team. We feel they are capable of securing the continued funding necessary to execute and have the wherewithal to deliver on the company's vision.
Action Item: For most businesses, the potential loss of brand image and customer/partner trust from lost data is significantly greater than slightly improved times to start doing business again. Technology has very significantly reduced the cost of zero data loss solutions and can be justified from simpler and more efficient business processes. CEO and CIOs should focus on creating a long-term strategy for providing a zero data loss solution for all active data.
Footnotes: 1 Expected loss is calculated from the sum of all disastrous events that can occur multiplied by the probability of an event occurring within a given time interval. The formula below is used within the model to calculate the expected annual loss from n events which lead to loss of data and IT services not being provided, and which have a financial impact ($Impact), and have a probability (p) that the event will occur within a year. Insurance companies are a good source of information about the probability of different events.