Storage Peer Incite: Notes from Wikibon’s July 31, 2007 Research Meeting
Moderator: Peter Burris & Analyst: David Floyer
This week Wikibon presents Avoiding disaster recovery disasters. Too often businesses talk about disaster recovery (DR) but never really invest in it, or they put some practices in place but never enforce them or really test them in a realistic scenario. And too often when an organization undertakes a DR program, it is driven by the IT department, working with insufficient budget, and making basic decisions with only limited understanding of business needs in a disaster. While IT has a responsibility to protect data, a DR program must be driven by business executives who understand its importance, determine what level of investment is appropriate for their company and what level of risk they are willing to accept. Operating departments and lines-of-business must also be involved in a realistic assessment of the risks and of the appropriate levels of protection, as well as in how to best achieve that protection. IT must take a subsidiary role, contributing information on requirements and expenses of various technical DR approaches and carrying out the plans ultimately approved by business leadership. The ultimate tests of the commitment of that leadership are first the level of funding it ultimately approves for DR and second the willingness of senior corporate and line-of-business management to participate in DR exercises. Bert Latamore
Contents |
Avoiding disaster recovery disasters
In this post-9/11, post-Katrina era, businesses have at least begun discussing the degree to which their IT infrastructure in general and storage infrastructure in particular is ready to respond to a potential disaster. The Wikibon community observes that businesses frequently talk about the need for comprehensive disaster recovery but usually fail to fund that effort appropriately. More importantly, they often fail to fully engage in the planning process to ensure that the processes and programs for returning the business to acceptable operational mode are fully adopted by everyone concerned.
For IT organizations put in the position of having to initiate a disaster recovery planning process, we suggest four areas of focus:
- Develop a clear assessment of business impact,
- Gain full agreement of all stakeholders for the metrics for measuring, assessing, and communicating that business impact of a potential disaster,
- Develop all the skills for implementing the activities involved in preparing for and recovering from a disaster,
- Gain agreement on the level of investment in DR and the right set of returns adopted.
On each of these points, IT organizations must accept a secondary role and push business leaders to take responsibility for moving the planning and execution processes forward.
The business impact activities are critical. Business leaders must fully factor both the dollar volume impact of a disaster and the likelihood that the disaster will occur. To accomplish this, at least three groups must be fully vested in the planning process:
- Line-of-business leaders focused on revenue of their business operations along many dimensions including time,
- IT organizations that will need to quantify and categorize the technology infrastructure risks, and,
- Experts in facilities, who must provide clear guidance regarding the likelihood that an external event might impact the locations of IT and/or other business assets.
Other groups – compliance, the CXOs, and perhaps even the board of directors who will have to sign off on commitments regarding data security, quality and availability – may also be involved.
An increasingly popular technique for assessing business impact against multiple dimensions is triangulation. The business will triangulate the likely costs of a disaster as a function of volume of lost business and risk using internal estimates, assessments by third parties such as insurance companies, and other external resources such as its investment bank’s assessment of the likely impact of a disaster on the company’s capitalization.
The metrics used to optimize those decisions include Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Neither of these metrics is sufficient, but both are necessary to drive the planning process.
Finally we note that allocating the dollars required to achieve the optimum level of disaster recovery preparedness and response is an ultimate test of the company’s commitment to forge high quality disaster recovery strategies. These dollars include out-of-pocket expenses for products and redundant resources, the willingness of business leaders to take time to practice appropriate DR techniques and approaches, and ultimately the degree to which full testing of different DR plans is allowed.
IT groups in organizations that talk disaster recovery preparedness and optimization and do not fund it should not be seduced by the opportunity to buy more technology or experiment with new products but instead force the business to lead the process as aggressively as possible. Organizations that need the highest degree of preparedness, particularly in the financial industry, are favoring three-node data center architectures – with two centers geographically close together linked by synchronous data connections backed up by a third, remote site (e.g., across the continent or the Atlantic) with asynchronous data backup – as a part of the solution.
Action item: IT organizations should not attempt to lead planning efforts for disaster recovery but instead must not only accept but demand business leadership in this critically important domain. IT organizations should, however, immediately begin investigating the three-node data center architecture as a technical approach to provide the IT architecture that may be required for DR and establish strong relationships with facilities and other internal groups that will ultimately have to assess the probability and consequences of having to respond to a major disruption in the business.
Triangulating Disaster/Recovery Cost and Risk
Optimizing disaster/recovery (DR) investments requires a consensus on two metrics up front: (1) the business cost of a disaster's impact; and (2) the risk that the disaster will occur. From this information, an estimate of the likely scope and scale of disaster-related losses can be created.
Calculating each of these two metrics can be relatively easy. For example, business costs include lost revenue, destruction of assets, reduction of brand luster, etc. The challenge, however, is to reach the degree of consensus across multiple different operating groups and expertise centers required to forge a successful investment strategy.
Triangulating different estimates from multiple, credible sources often is an essential step in process of building a consensus DR business case. Four different sources for the triangulation process are gaining favor among professionals responsible for building DR investment strategies. First, DR planners almost always aggregate internal, "bottoms up" estimates of the scale and scope of disaster impacts. Second, occasionally DR planners solicit estimates for the costs of covering losses from the disaster from insurers, using these policy costs as a reasonable approximation of likely disaster costs. Third, government regulations in many industries (e.g., financial services) use standard methods to calculate reserve levels required to cover the impacts of a disaster. Finally, some creative firms are approaching investment bankers to render a valuation opinion in terms of a disaster (i.e., if this disaster should occur, what would be the effect on our stock price); this approach often draws significant executive attention, given that most executive compensation plans are closely tied to stock prices.
Each of these approaches forms a reasonable baseline for building a DR business case. Taken together, they will facilitate a quality consensus in an otherwise politically charged decision making environment.
Action Item: Garnering cost and risks assessments from as many credible sources as possible produces better estimates and strengthens DR investment consensus. To optimize DR investment decisions, organizations must factor cost and risk estimates from professionals (who sit atop their own expertise networks) in the business of pricing risk.
What is IT's role in justifying storage disaster recovery spend?
As suggested in 'Avoiding disaster recovery disasters' numerous groups must participate in the triangulation process of assessing the business impact of a disaster, justifying expenditures and ultimately implementing a solution. These include:
- Line of business (LoB) heads to assess revenue impact
- The CEO/CFO/Board to guarantee the initiative is of a high enough priority to receive funding
- LoB application owners to help understand the business process effects,
- Risk management to provide depth to loss mitigation strategies,
- Corporate audit to ensure standards are put in place and met,
- Chief security officers to understand and mitigate exposures to the extent possible,
- Facilities and logistics experts to establish contingency plans and scenarios,
- And of course, IT.
As the saying goes, 'there is no such thing as an IT project' and this is especially true for disaster recovery. IT's vital role is to facilitate the initiative by building awareness, educating constituencies and driving common definitions and coordination activities. Indeed, communicating the nature of the problem is in and of itself critical as it will be easy for executives to mistakenly buy in to industry hype about 'systems that never go down.'
Action Item: IT must clearly articulate its role in justifying and optimizing disaster recovery investments and limit its scope of responsibility to communicating the need, establishing common definitions, coordinating activities and educating key constituencies about technology tradeoffs and associated risks.
Focus on RPO
Traditional backup systems provide a point-in-time backup. The amount of data lost is up to a day's worth; the average Recovery Point Objective (RPO) is more than 20 hours. However, with a combination of paper and electronic systems, it was possible to throw labor at the problem and recover most of the data.
In today’s integrated systems environment, with transactions generated from inside and outside the organization, any thought that data could be recovered other than automatically is an illusion. Data inconsistency means manual intervention. Loss and inconsistency of data create extreme uncertainty in customers, suppliers and the business.
The ability to recovery automatically to a consistent point with the minimum of data loss is becoming a business imperative for most intermediate and large organizations. Financial organizations are leading the way in installing systems that provide close-to-zero data loss recovery with three-node topologies. As the cost of these technologies come down over time, other industries will adopt this approach, and applications and infrastructure will be architected for close to zero data loss.
Action item: IT executives should aggressively educate their constituents that improving RPO and avoiding data loss/data inconsistency will in general be a better investment than focusing on RTO and getting systems up faster.
Selling storage disaster recovery: Don't be column fodder
With rapid advances in storage technology including data de-duplication, C-site solutions, virtual tape libraries (VTL's) and the like, storage suppliers have plenty to discuss with IT buyers and storage administrators. However when it comes to selling disaster recovery, vendors should not stop at the IT department. More than most projects, the complexity of disaster recovery initiatives involves a multi-phased justification process, and storage companies need to demonstrate an understanding of key constituencies and their problems/priorities, and be able to articulate a vision of how their company and its solutions will address those priorities over the longer term.
The challenge for storage sellers is that their primary contacts are in the IT department. The conundrum is that these advocates are going to be most sensitive to budget and the enormous costs of the initiative and less focused on what really matters (e.g. the business impact analysis and the process of building consensus). Storage companies must find ways to participate in the assessment process across the organization, at the least to collect credible data points for proposal development, and ideally to affect the outcome of the deal.
Action Item: Selling to just IT will get storage companies in on the RFP but it won’t win the deal. Storage sellers must understand how the DR decision will be made and become a resource to the decision makers by assisting with the process of triangulation. This will require forging joint-marketing partnerships with insurance companies, solidifying third-party service relationships and establishing other critical path alliances to elevate their importance in the eyes of key constituencies.
Eliminating disaster recovery testing
In the case study presented in the July 31, 2007 Peer Incite Meeting, IT wanted to radically change the philosophy of remote recovery, and build resilience into both the applications and infrastructure. Rather than testing remote disaster recovery as a special case a few times a year, an expensive process which introduces risk of data loss in and of itself, they wanted to be able to switch applications to any node, local or remote, as a normal part of operations.
The key to realizing this dream is to have fail-over and fail-back mechanisms working with zero data loss at both the production sites and disaster recovery sites.
Action item: For successful disaster recovery, procedures should be part of normal operations and not require 'unusual gymnastics.' While the circumstances of a disaster are most certainly unusual and unpredictable, simplifying the processes around recovery are critical to success.