Co-author: David Floyer
Tip: ctrl +/- to increase/decrease text size
There has been significant discussion in the industry about storage optimization and making better use of storage capacity. A number of storage vendors have successfully marketed data de-duplication for offline/backup applications, reducing the volume of backup data by a factor of 5-15:1, according to Wikibon user input.
Data de-duplication as applied to backup use cases is different from compression in that compression actually changes the data using algorithms to create a computational byproduct and write fewer bits. With de-duplication, data is not changed, rather copies 2-N are deleted and pointers are inserted to a 'master' instance of the data. Single-instancing can be thought of as synonymous with de-duplication.
Traditional data de-duplication technologies however are generally unsuitable for online or primary storage applications because the overheads associated with the algorithms required to de-duplicate data will unacceptably elongate response times. As an example, popular data de-duplication solutions such as those from Data Domain, ProtecTier (Diligent/IBM), Falconstor and EMC/Avamar are not used for reducing capacities of online storage.
There are three primary approaches to optimizing online storage, reducing capacity requirements and improving overall storage efficiencies. Generally, Wikibon refers to these in the broad category of on-line or primary data compression, although the industry will often use terms like de-duplication (e.g. NetApp A-SIS) and single instancing. These data reduction technologies are illustrated by the following types of solutions:
- NetApp A-SIS and EMC Celerra which employ either “data de-duplication light” or single-instance technology embedded into the storage array;
- Host-managed offline data reduction solutions such as Ocarina Networks;
- In-line data compression appliances available from IBM Real-time Compression.
Unlike some data reduction solutions for backup, these three approaches use lossless data compression algorithms, meaning mathematically, bits can always be reconstructed.
Each of these approaches has certain benefits and drawbacks. The obvious benefit is reduced storage costs. However each solution places another technology layer in the network and increases complexity and risk.
Array-based data reduction
Array-based data reduction technologies such as A-SIS operate in-line as data is being written to reduce primary storage capacity. The de-duplication feature of WAFL (NetApp’s Write Anywhere File Layout) allows the identification of duplicates of a 4K block at write time (creating a weak 32-bit digital signature of the 4K block, which is then compared bit-by-bit to ensure that there is no hash collision) and placed into a signature file in the metadata. The work of identifying the duplicates is similar to the snap technology and is done in the background if controller resources are sufficient. The default is once every 24 hours and every time the percentage of changes reaches 20%.
In addition, there are three main disadvantages of an A-SIS solution, including:
- With A-SIS, de-duplication can only occur within a single flex-volume (not traditional volume), meaning candidate blocks must be co-resident within the same volume to be eligible for comparison. The deduplication is based on 4k fixed blocks, rather than the variable block of (say) IBM/Diligent. This limits the de-duplication potential.
- There is a complicated set of constraints when A-SIS is used together with different snaps depending on the level of software. Snaps made before deduplication will overrule de-duplication candidacy in order to maintain data integrity. This limits the space savings potential of de-dupe. Specifically, NetApp's de-dupe is not cumulative to space efficient snapshots. See (technical description);
- The performance overheads of deduplication as described above mean that A-SIS should not be applied to a highly utilized controller (where the most benefit is likely to be achieved);
- There is an overhead of for the metadata (up to 6%)
- To exploit this feature, users are locked-in to NetApp storage.
IT Managers should note that A-SIS is included as a no-charge standard offering within NetApp's Nearline component of ONTAP, the company's storage OS.
Host-managed offline data compression solutions
Ocarina is an example of a host-managed data reduction offering or what it calls 'split-path.' It consists of an offline process that reads files through an appliance, compresses those files and writes them back to disk. When a file is requested, another appliance re-hydrates data and delivers it to the application. The advantage of this approach is much higher levels of compression because the process is offline and uses many more robust algorithms. A reasonable planning assumption of reduction ratios will range from 3-6:1 and sometimes higher for initial ingestion and read-only Web environments. However, because of the need to re-hydrate when new data is written, classical production environments may see lower ratios.
In the case of Ocarina, the company has developed proprietary algorithms that can improve reduction ratios on many existing file types (e.g. jpeg, pdf, mpeg, etc), which is unique in the industry.
The main drawbacks of host-managed data reduction solutions are:
- The expense of the solution is not insignificant due to appliance and server costs needed to perform compression. In infrequently accessed, read-only or write-light environments, these costs will be justified.
- To achieve these benefits, all files must be ingested, which is a slow process. Picking the right use cases will minimize this issue.
- After a file is read and modified, it is written back to disk as uncompressed. To achieve savings, files must be re-compressed again limiting use cases to infrequently accessed files.
- Ocarina currently supports only files, unlike NetApp A-SIS which supports both file and block-based storage. However Ocarina's implementation offers several advantages over A-SIS (remember A-SIS is free).
- The solution is not highly scalable because the processes related to backup, re-hydration, and data movement are complicated.
On balance, solutions such as Ocarina are highly suitable and cost-effective for infrequently accessed data and read-intensive applications. High update environments should be avoided.
In-line data compression
IBM Real-time Compression offers in-line data compression whereby a device sits between servers and the storage network (see Shopzilla's architecture). Wikibon members indicate a compression ratio of 1.5-2:1 is a reasonable rule-of-thumb.
The main advantage of the IBM Real-time Compression approach is very low latency (i.e. microseconds) and improved performance. Storage performance is improved because compression occurs before data hits the storage network. As a result, all data in the storage network is compressed, meaning less data is sent through the SAN, cache, internal array, and disk devices, minimizing resource requirements and backup windows by 40% or more, according to Wikibon estimates.
There are two main drawbacks of the IBM Real-time Compression approach, including:
- Costs of appliances and network re-design to exploit the compression devices. The Wikibon community estimates clear ROI will be realized in shops with greater than 30TB's;
- Complexity of recovery, specifically users need to plan for re-hydration of data when performing recovery of backed up files (i.e. they need to have a Storewize engine or software present to recover from a data loss).
On balance, the advantages of an Ocarina or IBM Real-time Compression approach are they can be applied to any file-based storage (i.e. heterogeneous devices). NetApp and other array-based solutions lock customers into a particular storage vendor but have certain advantages as well. For example, they are simpler to implement because they are already integrated.
An Ocarina approach is best applied in read-intensive environments where it will achieve better reduction ratios due to its post-process/batch ingestion methodology. IBM Real-time Compression will achieve the highest levels of compression and ROI in general purpose enterprise data centers of 30TB's or greater.
Action Item: On-line data reduction is rapidly coming to mainstream storage devices in your neighborhood. Storage executives should familiarize themselves with the various technologies in this space and demand that storage vendors apply capacity optimization techniques to control storage costs.
Footnotes: RELATED RESEARCH
- Podcast: Shopzilla and UBS discuss in-line compression results (5 mins).
- Podcast: Wikibon practitioners discuss in-line compression advantages and pitfalls, in depth. (30 mins).