Originating Author: Robert Levine
Enterprises of all sizes are re-evaluating backup and recovery strategies. This is driven by the following factors:
- Organizations have always been aware of the impact of local threats and vulnerabilities (fire, flood, bomb threat, and hardware / software / network / media failure) on their continuity of business.
- More recent terrorist attacks and threats, power outages, extreme weather conditions, and other large-scale disasters have shown that regional impacts can occur too – and can undermine many assumptions of traditional backup and recovery planning.
- In today’s competitive and global e-business driven world, expectations for system performance and uptime are higher than ever before – amid greater demand to shrink backup and recovery windows. Many enterprises that include trading operations, health services delivery, and key manufacturing processes have always required well-managed backup and recovery processes.
- Financial services, healthcare, and other regulators increasingly see backup and recovery processes as key to a well-controlled operational environment.
- Mobile computing technologies, home offices, and other drivers that distribute data processing further than ever before create new challenges in ensuring that distributed information is properly backed up.
- Business and technology requirements are changing quicker than ever before; these changes mean that backup and recovery capabilities need to be well planned and flexible.
Specific operational goals of implementing an effective backup and recovery capability
There are specific goals associated with implementing recovery point objectives / recovery time objectives:
- Business-defined recovery point objective (RPO), maintenance point objective (MPO), and recovery time objective (RTO) metrics, and most other business requirements for continuity of operations, can only be met through appropriate backup and recovery solutions.
- Other key metrics that should show improvement with robust backup and recovery include: time to backup (measured for each storage tier, technology, server type, application, and database type), time to restore (measured similarly), number of support calls relating to backup or recovery issues, percentage of time that systems problems lead to data loss, percentage of time that systems problems lead outages, average downtime per month, and percentage of IT time spend conducting backup and restore operations.
- A strong backup and recovery capability is a competitive advantage (indeed, a necessity) to organizations operating in multiple time zones, delivering e-business services, or generally operating a high availability business model.
Risks of implementing an effective backup and recovery capability
It is not technically difficult to install and operate a backup and recovery scheme, but it is too easy to implement the “wrong” scheme – i.e., one that either provides insufficient protection or that provides more protection than needed at an excessive price tag. In some organizations (particularly those than have grown through mergers and acquisitions rather than organically) there are multiple, overlapping backup and recovery capabilities that can be rationalized to provide the same level of protection at less cost. Clear business requirements and targets for recovery can ensure that this does not happen.
The effective backup and recovery solution
The business driver to initiate a backup and recovery solution is to strike the optimal balance between continuity of business, data protection, and cost. Implementing an effective backup and recovery solution is done by needs analysis, system design, and deployment / monitoring.
Expectations (Out-of-scope)
Defining recovery point objectives, recovery time objectives, and more broadly completing a business impact analysis is the starting point towards understanding backup and recovery needs but is out of scope of this note – which assumes that these steps have already been completed.
Analyze phase
The analysis phase includes investigating the various backup and recovery scenarios in light of defined recovery requirements and metrics:
- An inventory of existing backup and recovery schemes, and supporting schedules, should be collected
- A gap analysis of these existing capabilities versus defined RPOs, RTOs, and other metrics and requirements should be conducted.
- The gap analysis should also consider the overall business and technology strategy, to ensure that the recovery strategy includes not just current systems and data but projected systems and data.
- Gaps identify areas to be addressed by changing configuration parameters, adjusting backup scheduling, altering the storage tiers involved in the backup strategy, adding capacity, increasing bandwidth, or choosing entirely new recovery technologies.
- Any duplicate or excessive backup coverage should be identified for rationalization.
- File server, server-based, enterprise-wide / full backup & recovery, server-less backup & recovery, continuous backup schemes should be considered
Backing up data on a local or centralized file server (using either disk or tape drives for storage) is a very basic backup scheme. A system of full and incremental backups can help manage the size of the backup window. This solution is easily implemented and relatively inexpensive but is not scaleable and is also heavily dependent upon local area network bandwidth and availability. It is also very technology dependent, and may not work with legacy technologies. Another approach is to assign distributed backup clients throughout the enterprise, which move backup data over the network to a backup server or cluster of backup servers. This solution is more scaleable than the first. Yet another approach is to perform full local desktop and server backups on local backup servers, which are then “cloned” to safe remote backup servers over the network. This works particularly well in highly-distributed computing environments. The major disadvantage of all of these methods is that data and systems are not available during backup processing, and that it is difficult to impossible to assure a continuous recovery point with such schemes.
So-called “server-less backup” schemes use high-capacity (e.g., fiber channel) storage area networks to move backup data between storage tiers and to backup servers. (Despite the name, a server is still necessary to initiate and manage data moving over the SAN). This can help reduce (but not eliminate) the backup window and increase the recovery point, but typically at much higher costs owing to the expense of the fast network. Also, many devices (local servers, workstations, remote devices) are likely not connected to the SAN and will not benefit from this higher level of service. Finally, “continuous backup / archiving” solutions like Sun StorEdge SAM-FS purport to offer continuous backup, and easy restore from any recovery point.
Besides backup, you should also consider the performance of the recovery process. All backup processes based upon full or full and incremental backups introduce latency into recovery. First, full backups are restored. If incremental backups have been used, they must then be restored in order to get the system to its state as of the last incremental backup. If recovery will occur from a remote backup clone, that can add additional time. So while the backup process can take up to a few hours, the recovery process can talk up to a few days in many cases.
Acceptance Test Considerations
The Analyze Phase is complete when the recovery gap analysis has been completed, when existing backup and recovery solutions have been assessed, and when appropriate alternatives (given technology, cost, and need) have been identified. The subsequent Design Phase will specify improvements and select an alternative. The Analyze Phase should normally take a few weeks at most.
Key analysis milestones
Milestones in the Analysis Phase typically include the following:
- An inventory of existing backup and recovery schemes and schedules.
- A gap analysis of these existing capabilities versus defined RPOs, RTOs, and other metrics and requirements..
- Gaps have been classified and prioritized.
- Duplicate / excessive backup coverage has been identified.
- A preliminary scan of alternative solutions has been conducted and mapped to gaps.
Design phase
As a first stage towards designing solutions for the gaps identified in the gap analysis, it is important to look at the capacity of the existing backup and recovery solution. This includes:
- Number and type of disk storage devices
- Number and type of tape drives
- Amount and type of available tapes
- Number of I/O channels per server
- Capacity issues with the backup software application(s)
- Number and capacity of backup servers
- Backup scheduling software in use
- Bandwidth dedicated to backup / recovery network operations
- Backup options available in database, software, and operating systems in use (journaling, replication, etc.)
- Cold, warm, and hot backup sites in use
Many organizations first identify “low hanging fruit”, or easy adjustments to existing processes and capabilities to address gaps.
Next, the tools and capabilities identified as possibilities in the Analyze Phase should be evaluated against the following criteria:
The goals in choosing backup and recovery tools for this imperative included:
- The tools must support not dictate the business continuity strategy and associated metrics.
- Where possible, the tools should integrate with existing tools for scheduling, securing, and “vaulting” off site backup copies and with existing and planned technologies in use in the enterprise.
- The tools should provide scheduling, automation, and backup management capabilities.
- The tools must be capable of growing and scaling.
- The tools must be cost effective.
- The tools must pass user acceptance testing before any deployment.
Depending upon the nature of the business and technical requirements, and upon the process of investigating commercial recovery offerings, this phase can take from a few weeks to a few months.
Acceptance test considerations
The Design Phase is complete when “low hanging fruit” have been identified and additional tools and capabilities slated for selection.
Key design milestones
Milestones in the Design Phase include the following:
- Current backup and recovery tools and capabilities are inventoried.
- Each of the gaps in the gap analysis is evaluated in light of what can be done with current capabilities to solve the business problem cost-effectively.
- Technical design documents are written, and mapped to business continuity objectives and gaps.
- New tools and solutions are evaluated against best design practices.
- Alternatives have been subjected to a cost-benefit analysis.
- Funding and approval for implementation of the recovery solution(s) has been secured.
Deploy phase
The Deploy Phase involves writing and implementing a project plan and securing project and operational resources to deploy and manage the chosen solution(s) such that they address any gaps. Vendor due diligence should occur. Representative recovery processes should be tested for each acquired tool or solution. These should only be rolled into production after test results have been accepted by IT and the business.
The deployment and testing of a recovery solution typically takes at least several weeks.
Acceptance Test Considerations
The Deploy Phase has been successfully implemented when the users are satisfied that their recovery requirements (including their RPO and RTO targets) have been met, and when the IT department is happy that their technical requirements have been met in an effective manner. Testing should include the following:
- Integration with existing enterprise server operational and security tools, operational technologies, and with existing or newly acquired recovery sites should be assessed against documented recovery objectives.
- The tool should be tested to realistic job schedules.
- Speed analysis compared to the current backup / recovery solution should occur.
- It is important to restart failed events, or try to restart processing from a system error.
- The system should be easy to use, and should include sufficient documentation for installation and operations.
- Reporting, notification, and auditing of backup and recovery events must take place.
- Interoperability with existing AIS hardware infrastructure for Disk and Tape storage.
- The system should be manageable from a central or secure remote console.
- The vendor support response must be evaluated.
Key deployment milestones
Milestones in the Deploy Phase include the following:
- Project plan in place and agreed.
- Project team is formed.
- A deployment is arranged.
- Acceptance criteria are evaluated.
- A test plan is written
- Operators are trained in the new backup tools and capabilities.
- Testing is conducted.
- RPO, RTO, and supporting metrics are collected and evaluated against requirements.
- Management must sign off on test results prior to final deployment.
Initiative summary
Sizing the rollout of effective backup and recovery solutions is organization-specific. Each enterprise lies in its own state of process and technology maturity, and each organization will have different recovery requirements. Mature organizations that just need to “tweak” their current solutions could do so in weeks at very low cost. Firms in the middle of the road will typically keep some or many of their existing solutions but will implement faster backup networks, purchase additional backup and recovery software, or contract with recovery sites. They may need to hire contractors for the deployment, and additional operational staff to manage the live system. The latter often earn from $50,000 to $80,000 depending upon the location and complexity of the enterprise. Rolling out these solutions can take a few to several months, and can cost tens of thousands to hundreds of thousands of dollars (more for large and complex enterprises) – all the more so for organizations starting from scratch.