Deduplication systems are used primarily for backup. There are many confusing performance claims for de-duplication systems. Curtis Preston has put together a very good table of vendor claims on his Backup Central web site. In a table on his site he also lists the vendor claims for the performance of the major target de-duplication systems. The starting points of a performance analysis are:
- The performance of a single node,
- The number of nodes that will work together as a clustered global deduplication system (all nodes must be able to receive data from any backup stream and access all the metadata),
- How well the nodes work together (Wikibon extension).
The performance of clustered systems is rarely linear. There are overheads from shared resources that are are locked, overheads on maintaining cache coherency and contention for shared resources that reduce performance. As well, the reduction in performance is not linear, but increases with the number of nodes. The best way to think about it is to imagine that there are links between every node, and every link creates overhead. With two systems there is one link, with three systems there are three links, four six links, etc; the formula for the number of links between n nodes is n(n-1)/2.
One measure of overhead is the inter-node overhead for a 2-node system, which is given in Table 1.
The overhead is a function of the architecture and the type of workload being run on the system. A good clustered system with a multi-processor friendly workload can achieve a performance level of 1.9 for a 2-node system, or a 5% overhead. Table 3 in the footnotes applies the overhead to multi-node systems, and Chart 1 shows the same data graphically. What is interesting is that there is a maximum number of nodes that can be deployed before performance actually declines (this is reflected in the real world, where large n-way clusters are partitioned into multiple smaller clusters to minimize the inter-node overheads).
Table 2 combines the data from backup central for single node performance and estimates the performance of the maximum multi-node system. Evaluators should be aware that:
- The single-node performance will be the maximum the vendor can achieve under ideal conditions and should be significantly discounted to assess real-world performance.
- The Inter-node Overhead for a 2-node system is unlikely to be better than 5% under real-world conditions.
The key advantage of multi-node systems is that the deduplication takes place over a larger body of data and greater deduplication ratios can be achieved. The key advantages of a high deduplication rate is that the backup window will be under less pressure and the disaster recovery point objective (RPO) will be shorter.
Action Item: The maximum deduplication rate is important but is just one factor in selecting a deduplication system. The solution has to fit in with current backup procedures, has to be configured to meet RPO and RTO objectives, and must be configured to meet backup windows. Discount vendor performance claims for a single node, and discount even more heavily claims of linear multi-node performance.
Footnotes: