Contents |
Highlights
A data center within P&H Mining Equipment (P&H) had a problem recovering from a tape library that was too small. Recovery of files and emails was taking up to 72 hours, and IT was increasingly concerned that it would not be possible to recover from a major disaster. Rather than increase the capacity of the tape library from 500 to 1,000 tapes, P&H choose an innovative solution of a ten terabyte data de-duplication system from Data Domain. Importantly, the installation required no changes to the Tivoli Storage Management (TSM) procedures and virtually no education for storage administrators. The system took about two weeks to install and commission. The resulting solution was able to hold all the data previously managed by the tape library; and restores that once took 72 hours could be done in two hours. The data held on the Data Domain Storage after de-duplication was 5.5 terabytes. Without the de-duplication process this system would have required 20 terabytes, representing a reduction ratio of 3.6 to 1. This was close to a ratio of 4 to 1 that had been predicted by Data Domain in the RFP process, and was a key credibility builder from the perspective of P&H. P&H believes that the Data Domain system has performed to expectation. It has improved recovery times, productivity, morale, and created a more effective recovery procedure in the event of a major disaster. It has also helped to better position the shop for future requirements. The following factors summarize the P&H situation:
- The evaluation and selection process was performed effectively on the second RFP. The overall process was decelerated by the first RFP which was aborted by senior management; the capital required for the solution recommended was a surprise and there was no budget available;
- The main business driver was faster recovery enabled by moving from tape to a disk-based recovery system;
- Without data de-duplication, P&H would not have been able to cost justify a disk-based solution;
- The implementation and adoption of the technology was excellent as evidenced by P&H realizing Data Domain’s claims and expectations post implementation;
- The solution appears to be well-positioned to scale and grow over time.
Business Background
P&H is a global manufacturer of large excavating machines used to mine minerals and embedded materials. The business has about 2,500 employees, and is headquartered in Milwaukee, Wisconsin.
Original Storage Snapshot
The data center in Milwaukee supports 800 professional and engineering users. The information requiring protection is primarily user data held as files. The main types of data that need to be backed up are Lotus Notes databases, Groupwise data, Novell File server data and SQL flat text files. The real and virtual servers are backed up individually. While the recovery point objective (RPO) and recovery time objective (RTO) are not aggressive, there is a greater business focus on recovering data faster (RTO). The RPO in a disaster was about 24 hours. The RTO for a disaster was 3 days; IT believed that it would take significantly longer in reality. The RTO for data lost was one day, but actual performance was up to 72 hours (three days). Backups are managed using IBM’s Tivoli Storage Management software (TSM). The “incremental forever” option was used for everything except for the Notes backups where P&H is doing full backups every night. About 250GB of data each night are initially copied to the Tivoli server storage pool. Originally data were moved off to an LTO2 tape library the next day and then copies of tape were moved off site for long-term storage to meet major disaster recovery objectives and to meet compliance requirements. TSM was and remains the overall backup and recovery manager.
Storage Pain Points
The TSM backup process and backup window were not problems for the Milwaukee data center. RPO was not an issue. The key pain point for users and IT was RTO, both for the excessive time to restore files and email boxes, and for the risk of being unable to recover from a major disaster. As Nick Cannizzaro, who is the supervisor of infrastructure system architecture team at the P&H Milwaukee data center, said, “Backups are worthless until you need them; then they become priceless.” There were 500 tapes in the library. There was a constant movement of tapes from the library to local storage and to the off-site storage facility as the library was full. Retention on the tape library of the Lotus Notes files had to be reduced from 90 days to 30 days to ease the constraint on tapes, which had aggravated recovery times. TSM collocation (holding all the data for a server on one tape) had not been possible to implement, as P&H believes this would have increased the number of tapes to well over 750. Moreover, tapes were under-utilized and collocation would have made utilization worse. Storage administration was taking a significant amount of time managing tapes in and out of the tape library and to and from the off site storage facility. Restores were time consuming, and storage administrators were spending too much time managing tapes. There was also a risk of tape malfunctions, leading to a significant risk of not being able to recover from a major disaster.
Solution Strategy
P&H evaluated three primary alternative solutions to improve restore times, including:
- Increase the size of an existing Quantum tape library and implement TSM collocation
- Install a virtual tape library (VTL) system with or without de-duplication (at the time of the June 2008 RFP, the NetApp VTL did not have de-duplication, and the Diligent did)
- Install a de-duplication system from Data Domain
P&H asked a value added reseller (Forsythe) to conduct the requirements and evaluation. Forsythe used its own evaluation process to understand the key requirements for P&H and helped draw up a list of vendors that were most likely to meet these requirements. The table below shows the key vendors evaluated in depth.
Increasing the size of the Quantum tape library to 750 tapes would have been the lowest cost, and would have enabled some improvement to recovery time because of the higher probability of tapes being in the tape library and because of collocation. However, there would not have been sufficient space to increase the retention period of the Lotus Notes files back to 90 days. In addition, implementing the collocation would have meant that the tape library system would have run out of space with six to twelve months, and the recovery problems would have returned. Storage administration would not have improved, and the system would not have been cheaper in the long run. Implementation of TSM collocation would have been a significant change to the TSM procedures. P&H believed that the approach would have improved tape management and data recovery in the short term, but still did not give full confidence to recover effectively from a major disaster. The Data Domain solution met all the business requirements of P&H, was the simplest and least risk to implement because no changes were required in backup processes or storage administration skills. As well, Data Domain was viewed as the most flexible to meet the long term requirements, and would save the most on storage administration time.
Deep Dive: De-dupe Ratios & TSM
Data de-duplication ratios can vary significantly for different data types and update rates. Ratios as high as 20 to 1 can be achieved for data that has lower change rates and is backed up frequently. Data with higher change rates will exhibit lower ratios. Using Data Domain technology to de-duplicate disk storage P&H is seeing an average de-dupe ratio of 3.6 to 1 shaving nearly 75% off of comparable non-de-duped disk storage costs. The ratio of 3.6 to 1 is relatively low because of P&H's extensive use of TSM's incremental forever backup methodology which only backs up changes. The advantages of this approach are that the amount of data backed up each night is smaller, and the elapsed time for backup is significantly reduced. The disadvantage is that recovery is slower as more backup records have to be accessed to complete a recovery. In this case, the main driver for P&H was finding a cost effective way to move backup storage from tape to disk, allowing the TSM recovery process to be much faster. Despite what appears to be a relatively low de-dupe rate, the advantages of faster recovery relative to tape overwhelmingly supported the business case for P&H, especially since backup procedures remained intact.
Note: Some applications at P&H which employ a full daily backup are achieving much higher compression ratios of 15 to 1.
Adoption Issues for Data Domain Solution
Implementing the system without having to change any procedures ensured that the deployment and training time were minimal. With a VTL system, P&H assumed that about two months of training would be required before cutting over to the new system. The major issues with the Data Domain solution were setting up the NFS system on the AIX TSM backup server (all previous experience of the installation team had been with CIFS) and network card incompatibility problems. These were quickly resolved, and the total implementation and commission time was about 2 weeks, longer than expected but still quick. The system performed approximately as predicted by Data Domain in the RFP. Without the de-duplication process this would have been 20 terabytes, the average compression ratio was 3.6 to 1, with the compression ratio for Lotus Notes much higher, because a full backup is done each night. For the other data the Tivoli incremental forever option was used so that only the changed data is backed up, lowering the compression ratio. Overall, the average compression ratio achieved was close to a compression ratio of 4.0 to 1 that had been predicted by Data Domain in the RFP process and P&H was able to use the system to achieve all the business benefits planned. TSM remains as the overall backup and recovery manager.
Benefits
The benefits of the new system were:
- Moving the recovery process from tape to disk significantly improved recovery times from 72 to 2 hours;
- The installation was implemented quickly without any changes to the existing backup processes, and with no retraining or additional manpower requirement;
- The new system can grow as data grows without change and without headcount increases;
- The recovery process for files and email was significantly improved and are now meeting service level agreement (SLA) expectations, and satisfaction with IT is significantly higher;
- IT could commit to the business that the site could recover its IT operations from a major disaster within a reasonable time and meet the RTO targets;
- Released headcount resource from tape management to other IT projects;
- The solution was cost effective compared to the alternative products and approaches available to P&H mining.
A formal cost-benefit analysis is difficult without agreement on the business impact of loss of data and systems unavailability which was not conducted by P&H.
Conclusions
P&H had not performed a formal analysis on its RPO and RTO objectives. An understanding and estimate of the business costs of lost data and the costs of short-term and long-term system unavailability would have allowed at better underpinning of the decision process, and may have avoided the lost time to decision caused by having a second RFP. Wikibon would recommend getting a formal RTO & RPO agreement with the lines of business. With this proviso, Wikibon believes that P&H together with Forsythe conducted a comprehensive and effective evaluation of the alternatives. The choice of the simplest implementation with the least change was a sound business strategy. The implementation was slightly longer than expected, but still quicker than any alternative promised. The savings on storage administrative time justified the solution alone. The improvement in recovery time was a major benefit for the end-users. P&H tested an email recovery that previously took 72 hours and completed the process with the Data Domain system in about two hours. P&H believes that the process for recovery from a major disaster has been improved, and IT believes that a 3 day recovery RTO can now be achieved. Wikibon reserves judgment on this, as it was not clear how this improvement was tested. Data Domain met all the pre-sales claims during the implementation and adoption. The Data domain system is currently 55% utilized, and P&H believe there is significant head-room in the solution for storage growth. Wikibon concludes that the adoption of Data Domain’s solution was cost effective for the P&H IT department, and improved the recovery capabilities significantly.
Legal: © Wikibon 2009. This document is copyright protected by Wikibon and does not fall under the GNU general license terms for Wikibon.org. Links to this article from external sources are allowed, however any other re-distribution of this content for commercial purposes is strictly prohibited. Please contact Wikibon for more information.
Wikibon case studies are developed independently and their development is not initiated for or funded by any single company. Wikibon reports actual customer experiences and results with no attempt to emphasize any one vendor’s strengths or weaknesses.
Read the full disclaimer.