Originating Author: Prakash Babu
#memeconnect #fs
Contents |
Summary
A reference environment for fast data backup, replication and recovery was created, tested and measured. 100 terabytes were backed up, deduplicated, and replicated to a remote site in less than 14 hours using commodity hardware and hardware-agnostic, synergistic software. Keys to this achievement were:
- LAN-free backups
- Backup software that understands virtual tape and provides useful APIs
- VTL software with
- Unique de-duplication capabilities
- Advanced pre-fetch algorithms
- Exploitation of the backup software’s advanced VTL APIs
- Special treatment of the backup software’s media catalogue
- WAN-optimized replication
- Commodity disk subsystems and servers
- Fibre Channel SAN (4Gb)
- A high degree of parallelism
- Dark fiber between the sites
Re-Engineering Traditional Backup and Recovery
Traditional strategies for backup involve installing backup agents on each application server and backup data mover software on dedicated backup servers. Data to be backed up is sent by those agents over a LAN to the dedicated backup servers where it is consolidated and sent to real or virtual tape drives. Here, the LAN is a major bottleneck.
Another issue is that until recently backup software did not understand virtual tape. While VTLs did a great job of fooling the backup software into thinking it was talking to real tape, the result was two media catalogs with no way to synchronize them. Additionally, space reclamation in the VTL could only be done on a virtual tape volume basis. And the amount of free and used space available in VTL was often reported incorrectly by the backup software.
Also, the advent of advanced data reduction techniques with their dramatic reduction ratios and fast speeds enables a whole new way to think about replicating data to remote sites for DR purposes.
LAN-free Backup
With SANs becoming ubiquitous, many backup software vendors now offer the option to send the backup data stream over the SAN instead of the LAN. This typically pits 1Gb Ethernet against 4Gb Fibre Channel. Moreover, data transfer is done on a block basis instead of a file basis resulting in improved efficiencies. This reference environment used a 4Gb Fibre Channel SAN.
Backup Software
For this reference environment, the backup software used was Symantec’s Netbackup 6.5 with the OpenStorage (OST) option. OST is a new technology for Symantec NetBackup environments designed to simplify backup operations and reduce disaster recovery (DR) costs. With OST and certified VTL software, NetBackup:
- Allows for intelligent use of backup capacity;
- Can age backup images independent of media;
- Will free up backup images when no longer needed;
- Is aware of de-duplication capacity;
- Can perform partial restore without the need to "read-past" unneeded data in large backup sets;
- Is aware of replication of backup images to the remote site.
NetBackup media server software was installed on two Linux servers and backup agent software on the application servers.
VTL and Data Reduction Software
For this reference environment the VTL software was the FalconStor Virtual Tape Library (VTL) software including its OST Option, a plug-in component for NetBackup that works with the OST API to integrate key VTL functions with NetBackup operations.
VTL
The FalconStor VTL software operates in a clustered environment supporting up to eight nodes. For this environment only two nodes were needed. They were hosted on high-end commodity servers (2 Dual Core Intel Xeon Processors 7200 series: E7220, 2.93GHz, 8MB cache) and included HiFN data compression cards. Each server had eight Fibre Channel ports and ran Red Hat Linux Enterprise v5.1. Four dual-headed commodity RAID disk arrays were used with a total usable capacity of 100 terabytes on SATA drives.
Single Instance Repository (SIR)
SIR is an optional data reduction post-process that runs asynchronously but simultaneously with VTL operations. It first removes duplicate blocks (de-dup) and then compresses the remaining blocks. SIR also operates in a clustered environment supporting up to 8+1 nodes. Advanced pre-fetch algorithms ensure data restores achieve high throughput.
In this test, only four nodes were used. They were hosted on high-end commodity servers (4 Quad Core Intel Xeon Processors 7400 series). Each server had eight Fibre Channel ports and ran Red Hat Linux Enterprise v5.1. Ten dual-headed commodity RAID disk arrays were used with a total usable capacity of 550 terabytes on SATA drives.
The Link
A single 4Gb Fibre Channel link was used between sites. In this test environment the link is available to be shared with other applications.
The Test
The test set consisted of the following:
- Backing up 100 TB of data from mixed workloads (web servers, e-mail, database, etc.),
- Reducing the data (deduplication and compression),
- Replicating the reduced data to the DR site as it became available,
- Restoring the data at the DR site.
Results
Table 1 and Figure 1 summarize the results. Overall it took 14 hours to backup and reduce the data and 11.6 hours to restore it. Of particular note is that of the original 100TB, only 2.5TB needed to be replicated to the remote site. This kept the link utilization under 15%.
Performance | Test bed | Elapsed Time | |
Key Metric | Rate per Node | Performance Rate | in Hours |
Ingest rate | 1.4 GB/sec | 2.8 GB/sec | 10 |
Deduplication throughput | 0.5 GB/sec | 2.0 GB/sec | 14* |
Data Reduction Ratio 40:1 (20:1 Dedup., 2:1 Compression) | |||
Replication time for reduced data (4 Gb FC link) | 0.4 GB/sec*** | 0.4 GB/sec*** | 0** |
Total Time to Protect Data (ingest, reduce, replicate to DR site) | 14 | ||
Time to Recover at DR site | 1.2 GB/sec | 2.4 GB/sec | 11.6 |
* Overlaps with ingest time |
** Replication occurs simultaneously with data reduction = zero net additional time |
*** Link is appx. 15% utilized for this workload |
Source: FalconStor Software
Conclusions
The data reduction rate exceeded all industry metrics with only a nominal number of nodes in a cluster. Achieving maximum throughput in a backup/restore system is always a question of balance. If more nodes were used, it is possible the disk arrays or server bandwidth would become the bottleneck. Nonetheless and despite this respectable performance achievement, even faster speeds are possible.
Action Item:
Footnotes: