Originating Author: Robert Levine
The recovery point objective (RPO) is the maximum acceptable level of data loss following an unplanned “event”, like a disaster (natural or man-made), act of crime or terrorism, or any other business or technical disruption that could cause such data loss. The RPO represents the point in time, prior to such an event or incident, to which lost data can be recovered (given the most recent backup copy of the data). The recovery time objective (RTO) is a period of time within which business and / or technology capabilities must be restored following an unplanned event or disaster. The RTO is a function of the extent to which the interruption disrupts normal operations and the amount of revenue lost per unit of time as a result of the disaster. These factors in turn depend on the affected equipment and application(s). Both of these numbers represent key targets that are set by key businesses during business continuity and disaster recovery planning; these targets in turn drive the technology and implementation choices for business resumption services, backup / recovery / archival services, and recovery facilities and procedures.
Many organizations put the cart before the horse in selecting and deploying technologies before understanding the business needs as expressed in RPO and RTO; IT departments later bear the brunt of user complaints that their service expectations are not being met. Defining the RPO and RTO can avoid that pitfall, and in doing so can also make for a compelling business case for recovery technology spending and staffing.
Recovery point objective / recovery time objective capability
Numerous studies have been conducted in an attempt to determine the cost and other effects of downtime for various applications, capabilities, or other activities in the organization. These studies have concluded that the cost actually varies depending upon longer-term factors as well as short-term effects, and that these costs must be borne into account when choosing a business recovery strategy. This is where the RPO and RTO come in. Once the RTO for an application has been defined, business continuity planners can decide which disaster recovery technologies are best suited to the situation. For example, if the RTO for a given application is one hour, redundant data backup on external hard drives may be the best solution. If the RTO is five days, then tape, compact disk, or offsite disk storage may be more cost-effective. However, a near-zero RTO (for a time-sensitive operation like a dealing room in a bank) may require newer “continuous backup” technologies. Likewise, the definition of the RPO, or point in time to which work must be restored following a disruptive event, drives the technology and implementation choice. Many businesses will make do with a “close of business” RPO, which means that they are happy restoring data, systems, and activities to the prior day’s close of business in the event of a disaster or incident. Others (remember the dealing room) will have a “point of failure” RPO – in other words, the data, systems, and activities must be recovered to the point that they failed. And of course, there are many other RPOs and RTOs that can be defined.
Specific operational goals of implementing recovery point objectives / recovery time objectives
There are specific goals associated with implementing recovery point objectives / recovery time objectives:
- RPO and RTO targets are a key step towards defining the organization’s risk appetite and requirements for a business continuity plan.
- RPO and RTO allow IT to convince senior management of the need for recovery spending by using quantifiable targets set by the business as a basis for such spend..
- RPO and RTO help you in selecting the appropriate recovery technologies. They define the range of what is possible in terms of recovery technologies and processes.
- In doing this, RPO and RTO help achieve the right balance between meeting business objectives, yet not overspending on technology to meet goals that were never actually set.
- RPO and RTO figures facilitate the testing and audit of a business contingency and disaster recovery plan by providing a benchmark for measuring results, and by constituting the basis for a data recovery service level agreement (SLA)..
- As such RPO and RTO are key metrics for measuring recovery time and data characteristics. Other supplementary metrics include:
- Recovery Time Granularity (RTG) determines the time spacing between recovery points; whereas RPO is the last recovery point prior to a failure, RTG defines recovery point selection options prior to that recovery point.
- Recovery Object Granularity (ROG) expresses the level of objects that a recovery solution is capable of recovering. For instance, object granularity may be a storage volume, a file system, a database table, a database row / column / field, a transaction, a mailbox, an email message, etc.
- Recovery Event Granularity (REG) measures the ability of a recovery solution to track events and to recover an application or data to a specific event.
- Recovery Consistency Characteristics (RCC) measures the usability of recovered data by the associated application.
- Recovery Location Scope (RLS) defines where the protected data must be stored when recovery takes place (i.e., locally, remotely, on which media / storage tier).
- Recovery Service Scalability (RSS) measures the number of applications or data sets the recovery solution handles, and the maximum size of the data it can store.
- The Maintenance Point Objective (MPO) describes the maximum allowable window for the performing scheduled system maintenance
Risks of implementing recovery point objectives / recovery time objectives
There are few risks involved in setting RPO and RTO. Doing so may uncover weaknesses in your current recovery architecture, as it highlights solutions that exceed the RPO / RTO and are not cost-effective, or solutions that do not meet the RPO / RTO (and do not satisfy business requirements for recoverability.)
The recovery point objective / recovery time objective solution
The business driver to initiate a recovery point objective / recovery time objective – based solution is to balance the cost of a recovery solution with its performance, meeting user requirements for recovery in the process. These user requirements increasingly are driven externally to the organization, as other stakeholders (such as customers, business partners, regulators, auditors) express needs for business continuity. Implementing an RPO / RTO - based solution is done by needs analysis, system design, and deployment / monitoring.
Expectations (Out-of-scope)
Defining recovery point objectives and recovery time objectives is just a piece of the business continuity planning puzzle. This is part of the larger process of completing a business impact analysis (BIA), which is an assessment of business processes, information, and systems that forms the basis for recovery planning. Performing a business impact analysis is out of the scope of this entry. Also, it is important to note that some automated BIA tools will calculate RTO and RPO figures based upon the assessment results; it is important to confirm these calculated figures with key business users since they may not meet their expectations (if they do not, then the BIA itself may not be accurate!)
Analyze phase
The analysis phase begins with the business impact analysis, during which the following activities take place:
- critical business processes, systems, personnel, records, and data are identified
- any contractual commitments made for business continuity are collected
- recovery priorities are set
- internal and external dependencies are documented
- financial impacts for outages are quantified (including operational, legal, regulatory, and customer impacts)
- recovery point objective and recovery time objective (and associated metrics) are defined
This process may be iterative, requiring one or two passes by key business users and senior management. It also involves providing a bit of training to such individuals who may not be familiar with business continuity planning or the meaning of RPO and RTO. You should also expect some users to define very stringent RPO / RTO numbers when the business need is not entirely clear; accept that these figures could be reconsidered once the cost of the resultant recovery solutions becomes clear to them during the design stage.
It is also necessary to understand the detail behind the RTO. For example, if an RTO is set at 3 days, does this business process/function need to be restored at 100% of production capacity? Or could it be recovered in stages (certain capabilities within 1 day, others by 3 days, and others by 30 days)? RTO is not always a single number for an organization; there may be different RTOs by business, department, function, geography, application system, or system platform (mainframe, web, server, etc.).
Shareware BIA templates are available at no charge, though more comprehensive models are sold commercially (often for under $10,000). The staffing costs that go with this include charge for the time of any dedicated internal resources (typically one business continuity analyst or manager), or for external consultants where the resources are not available in-house. Large business recovery vendors (like SunGard, IBM, and EDS) may bundle BIA and other planning services with their recovery capabilities.
Acceptance Test Considerations
The Analyze Phase is complete when the business impact analysis is complete and when preliminary RPO and RTO figures have been established, justified, and accepted by the business and by management. This phase can take anywhere from a week in a smaller, less complex organization to several weeks or more in a large and multifaceted operation.
Key analysis milestones
Milestones in the Analysis Phase typically include the following:
- The business impact analysis (BIA) is complete and signed off.
- RPO and RTO numbers have been generated by the BIA process and / or by business users, and have been signed off by management.
- Details and timelines behind RTOs are clear.
Design phase
The business impact analysis, RPO, RTO, and supporting metrics form the business requirements for recovery and continuity. The first step of the design stage is to make clear the technology requirements. In this stage, the relevant systems, applications, databases, networks, storage tiers, and supporting technologies that form the backbone of the businesses inventoried in the BIA are described. Next, you are ready to map business requirements to technical requirements. At this stage, application-specific requirements for recovery are specified (some applications may require transaction processing during backup and recovery, others not). It is also at this phase that database considerations (rollback requirements, referential integrity considerations, etc.) are described.
With these requirements, the various recovery solutions can be designed, engineered, and selected. Differing technologies (such as mass storage array-based synchronous or asynchronous replication, operating system replication, continuous backup technologies, "snap" or BCV copies, tape generation and cloning, and multi-phase dual-site commit databases) all can be selected or combined to meet these requirements and the RPO / RTO targets. These technologies can be combined and using process and staffing, various scenarios can be put into place for the most cost-effective solution (for example, strategies could include disk-disk-archived tape, disk-disk-archived disk, and various combinations of scheduled onsite and offsite backups).
Finally, the business continuity analyst or manager must perform a financial analysis of the various possible solutions. This should include a review of:
- Initial capital expense
- Increases in operating costs associated with additional staff or consultants
- Media costs
- Maintenance costs
- Subscription costs to offsite storage locations or managed recovery locations
- Accounting / tax treatment of these costs
Depending upon the nature of the business and technical requirements, and upon the process of investigating commercial recovery offerings, this phase can take from a few weeks to a few months.
Acceptance test considerations
The Design Phase is complete when technology requirements are complete and mapped to business requirements, where a solution is designed, alternatives are identified and evaluated, and a cost-benefit analysis has been completed.
Key design milestones
Milestones in the Design Phase include the following:
- Systems, applications, databases, networks, storage tiers, and supporting technologies have been identified and mapped to the BIA results.
- Technical design has been described and mapped to business requirements.
- It is clear and traceable how each defined RPO and RTO will be met by this technical design.
- Alternative solutions have been identified and evaluated.
- These alternatives have been subjected to a cost-benefit analysis.
- Funding and approval for implementation of the recovery solution(s) has been secured.
Deploy phase
The Deploy Phase involves writing and implementing a project plan, supported by a well-resourced project team, to deploy the chosen solution(s) such that they meet the RPO and RTO figures already agreed. While RPO and RTO are best tested during an actual event, incident, or disaster situation, all effort must be made during the implementation to test a representative recovery process to determine whether this will be the case. Conducting a disaster recovery test is outside the scope of this note but as will other testing, a test of RPO / RTO compliance is best done in a technical environment that closely replicates the production operational environment. Where this is exceedingly difficult, some organizations will agree to conduct a brief test using the production environment (during a slow part of the business processing cycle).
The deployment and testing of a recovery solution typically takes at least several weeks.
Acceptance Test Considerations
The Deploy Phase has been successfully implemented when the users are satisfied that their recovery requirements (including their RPO and RTO targets) have been met, and when the IT department is happy that their technical requirements have been met in an effective manner.
Key deployment milestones
Milestones in the Deploy Phase include the following:
- Project plan in place and agreed.
- Project team is formed.
- A deployment is arranged.
- Acceptance criteria are evaluated.
- A test is conducted.
- RPO, RTO, and supporting metrics are collected.
- Lessons learned are documented, possibly leading to adjustments in the implementation plan, technical configuration, or even technology choice.
- Users and management conduct a final review of their requirements given test performance.
Initiative summary
Deploying RPO and RTO targets, including supporting metrics, can be done in any of the following circumstances:
- An immature of nonexistent business recovery strategy and architecture is in place
- Basic backup and recovery processes and technologies are in place, but these have been deployed over time without regard to formal business requirements (i.e., these were “IT-driven”)
- A formal process and capability for BCP is in place, and RPO / RTO definition is one of the final stages of process maturity.
In the first situation, RPO and RTO target deployment goes part and parcel with deploying a full BCP capability. Depending upon the size and complexity of the organization, that deployment can take from a number of months to over a year. The second case is perhaps the most common for today’s organizations. Typically a lot of the work involved in the first situation must still take place; the difference is that there are already existing recovery technologies, processes, and staff in place that can be leveraged for the ultimate solution, saving time and money. In the final situation organizations may be able to deploy RPO and RTO targets within an existing recovery architecture via tuning, reconfiguring, process enhancement, and adding capacity. In such instances, deployment could take place in weeks. If new technologies must be deployed to meet these targets, deployment could take a few to several months.