Originating Author: Robert Levine
The suite of backup and recovery options needed, and available, for today’s organization is radically different from that of a decade or two ago. Simple tape drives and native OS backup and recovery tools are no longer up to the task. Data center processing and storage sizes have grown exponentially, as have the number of centrally-supported and end user computing applications in use. There is a greater awareness of the risks, threats, and vulnerabilities affecting enterprises at the same time that end user expectations of recovery have increased. In this note, we will explore some of the ways organizations have moved beyond the simple backup methods of the past.
Effective backup and recovery techniques
Backing up and later restoring potentially huge amounts of data in a reasonable time is particularly challenging when systems processing or information access cannot stop to wait for this process. In some cases, backing up or restoring entire systems, or the enterprise, after an event or disaster can take days. Globalization of the enterprise’s operations or selling process and doing business over multiple time zones means that it can no longer afford to take systems or data off-line for significant periods of time during a backup or restore operation – or cope with maintenance or technology failures. Finally, the costs of data recovery from a damaged environment can be prohibitive; these costs can be avoided with a strong backup strategy. It is easy to see the need for better backup / recovery solutions, technologies, and processes to cope with these challenges.
Specific operational goals of implementing effective backup and recovery techniques
There are specific goals associated with implementing an effective backup and recovery capability:
- It should be possible to restore multiple systems (applications, databases, operating system platforms) to the same point in time.
- Remote backup systems, off-site storage, and vaults / archives should be manageable by the backup solution.
- Business-defined recovery point objective (RPO), maintenance point objective (MPO), and recovery time objective (RTO) metrics, and most other business requirements for continuity of operations, should be met through the backup and recovery solution.
- Other key metrics that should show improvement with robust backup and recovery include: time to backup (measured for each storage tier, technology, server type, application, and database type), time to restore (measured similarly), number of support calls relating to backup or recovery issues, percentage of time that systems problems lead to data loss, percentage of time that systems problems lead outages, average downtime per month, and percentage of IT time spend conducting backup and restore operations.
Risks of implementing effective backup and recovery techniques
It is not technically difficult to install and operate a backup and recovery scheme, but it is too easy to implement the “wrong” scheme – i.e., one that either provides insufficient protection or that provides more protection than needed at an excessive price tag. In some organizations (particularly those than have grown through mergers and acquisitions rather than organically) there are multiple, overlapping backup and recovery capabilities that can be rationalized to provide the same level of protection at less cost. Clear business requirements and targets for recovery can ensure that this does not happen.
Effective backup and recovery techniques
The business driver to initiate a backup and recovery solution is to strike the optimal balance between continuity of business, data protection, and cost. Implementing an effective backup and recovery solution is done by needs analysis, system design, and deployment / monitoring.
Expectations (Out-of-scope)
Defining recovery point objectives, recovery time objectives, and more broadly completing a business impact analysis and risk assessment is the starting point towards understanding backup and recovery needs but is out of scope of this note – which assumes that these steps have already been completed.
Analyze phase
The Analyze Phase includes investigating the various backup and recovery options in light of recovery requirements. The two basic types of backup are manual (where the backup and recovery processes are initiated manually) and automatic (where these are scheduled or event-driven). Except for small business or home office (SOHO) environments, virtually no organization will find that a manual backup system will meet their needs, so we will restrict our discussion to automatic systems. These include:
- Backing up data on a local or centralized file server (using either disk or tape drives for storage) is the most basic backup scheme. A system of full and incremental backups can help manage the size of the backup window. This solution is easily implemented and relatively inexpensive but is not scalable and is also heavily dependent upon local area network bandwidth and availability.
- Another approach is to assign distributed backup clients throughout the enterprise, which move backup data over the network to a backup server or cluster of backup servers. This solution is more scalable than the first. Yet another approach is to perform full local desktop and server backups on local backup servers, which are then “cloned” to safe remote backup servers over the network. This works particularly well in highly-distributed computing environments. The major disadvantage of all of these methods is that data and systems are not available during backup processing, and that it is difficult to impossible to assure a continuous recovery point with such schemes.
- A storage area network (SAN) can enable the sharing of storage devices and data across a fast network connection.
- So-called “server-less” or “LAN-free” backup schemes use high-capacity (e.g., fiber channel) storage area networks to move backup data between storage tiers and to backup servers – addressing the problem of storage device capacity exceeding network bandwidth. This can help reduce (but not eliminate) the backup window and increase the recovery point (also enabling centralized scheduling, reporting, and management), but typically at much higher costs owing to the expense of the fast network. Also, many devices (local servers, workstations, remote devices) are likely not connected to the SAN and will not benefit from this higher level of service.
- A disk-to-disk-to-tape (D2D2T) backup scheme sends backups first to disk, and then to tape for off-site storage. This allows the streaming and backup of several systems simultaneously. This is a popular scheme with high availability backup enterprises.
- Agent-based backup schemes have been developed to enable easier backup of both traditional databases and non-structured content. Block-level backup agents increase the speed of agents by enabling backup at the block level of the storage device.
- Audit trail compaction is a useful database recovery tool; audit trails are preprocessed off-line to maintain only the most recent before and after images of changed records. This allows for fast database recovery from media failure or database error.
- Frequent data “snapshots” are taken, and then using replication these snapshots are taken off-site to disk to make it available for immediate use without a restore. Tapes are then created at that off-site location for long term data storage.
- So-called object-based backup and delta-block incremental backup schemes save space and bandwidth (and time) by backup up only files it has never seen before.
- Continuous backup systems (usually referred to as continuous data protection, or CDP) offer block-level replication to the backup system as soon as a block has changed. Any changes to the source device that trigger changes to the backup device are logged such that the system allows restoration to any point in that recovery log.
Acceptance Test Considerations
The Analyze Phase is complete when the backup and recovery options have been considered in light of RPO, RTO, MTO, and detailed business recovery requirements. There may be more than one option that meets these requirements, so during the Design Phase we will consider how to choose. The Analyze Phase can take a few to several weeks. Where there is insufficient knowledge in the organization to choose and evaluate these options, an expert consultant is often brought in to help. This can cost from several thousand dollars for an independent consultant to well into five figures for a large and established consultancy.
Key analysis milestones
Milestones in the Analysis Phase typically include the following:
- An inventory of existing backup and recovery schemes and schedules has been collected
- Existing constraints (network bandwidth, servers / mainframes in use, physical processing locations, backup facilities, etc.) are understood
- RPOs, RTOs, MTOs, and other metrics and requirements are agreed and documented
- RFIs / RFPs have been sent to appropriate vendors of backup and recovery solutions; these are based upon agreed metrics and requirements
- Backup options and available commercial applications of these have been surveyed in light of these metrics and requirements
- A short list of options and technologies is chosen
Design phase
At this stage, the tools and capabilities identified as alternatives in the Analyze Phase should be evaluated against the following criteria:
- Where possible, the tools should integrate with existing tools for scheduling, securing, and “vaulting” off site backup copies and with existing and planned technologies in use in the enterprise.
- The tools should provide scheduling, automation, and backup management capabilities and must accommodate remote management, management of off-site storage, and vaulting / archiving processes.
- The tools must be capable of growing and scaling.
- The tools must be cost effective, offer attractive licensing terms, and be according to agreeable payment terms.
- Acceptance criteria for deployment should be set.
- The tools must pass user acceptance testing before any deployment.
- The tool vendor(s) must pass the organization’s due diligence process.
- The tool must be subject to adequate vendor support, with agreeable service levels.
- The tools should be supported by ongoing research and development, regular upgrades, etc.
- The vendor contract must be acceptable and pass internal legal review.
- Ideally the vendor should offer training and implementation assistance.
Depending upon the nature of the business and technical requirements, and upon the process of investigating commercial recovery offerings, this phase can take from a few weeks to a few months.
Acceptance test considerations
The Design Phase is complete when the tools and capabilities identified in the Analysis Phase have been evaluated against metrics, requirements, integration needs, implementation requirements, costs versus budget, and vendor / contract criteria.
Key design milestones
Milestones in the Design Phase include the following:
- Current backup and recovery tools and capabilities are inventoried
- Technical design documents are written, and mapped to business continuity objectives and gaps
- New tools and solutions are evaluated against these design specifications
- Alternatives have been subjected to a cost-benefit analysis
- Vendors and contract terms are subject to review and due diligence
- Funding and approval for implementation of the recovery solution(s) has been secured
- Acceptance criteria are set for deployment of the tools
- Tools are subject to user acceptance testing prior to deployment
Deploy phase
The Deploy Phase involves writing and implementing a project plan and securing project and operational resources to deploy and manage the chosen solution(s) such that they meet defined metrics and requirements. It is key at this stage that IT staff involved in backup and recovery operations receive thorough training and documentation in the new tools. Representative recovery processes should be tested for each acquired tool or solution (this is also a test of the adequacy of training and documentation). These should only be rolled into production after test results have been accepted by IT and the business.
The deployment and testing of a backup and recovery solution typically takes at least several weeks.
Acceptance Test Considerations
The Deploy Phase has been successfully implemented when the users are satisfied that their recovery requirements (including their RPO and RTO targets) have been met, and when the IT department is happy that their technical requirements have been met in an effective manner. Testing should include the following:
- Complete execution of various disaster / incident scenarios according to the agreed business continuity plan.
- Management of recovery from a central console, remote console, or from a recovery site.
- Use of realistic job schedules.
- Restarting of failed events, and recovery from a systems error.
- Reporting, notification, and auditing of backup and recovery events must take place.
- Level of vendor support during recovery operations.
Key deployment milestones
Milestones in the Deploy Phase include the following:
- Project plan in place and agreed
- Project team is formed
- A deployment is arranged
- Acceptance criteria are evaluated
- A test plan is written
- Operators are trained in the new backup tools and capabilities
- Testing is conducted
- RPO, RTO, and supporting metrics are collected and evaluated against requirements
- Management must sign off on test results prior to final deployment
Initiative summary
Traditional backup systems are being enhanced or replaced with D2D2T systems, server-less backups, specialized agents, and sophisticated technologies like snapshot/ replication based backup, object-based backup, delta- block backups, and continuous backup systems. These can help meet ever more aggressive RPO and RTO targets and more stringent requirements for recovery. The newer, more sophisticated backup and recovery technologies will almost certainly require the organization to procure consultancy assistance to select the right tool, and implementation assistance to deploy it. While most vendors offer the latter, it is in the enterprise’s best interest to select an independent / non-vendor consultant to help in tool selection. Rolling out these sophisticated solutions may involve more than tool deployment: network upgrades, security architecture reconfiguration, and rationalization of hardware and software may be required to facilitate this implementation. Where this is the case, full rollout can run into six-figure costs and a deployment schedule of several months. Less sophisticated implementations can be completed far quicker and cheaper.