It would be nice to apply data deduplication at any application, but unfortunately it's not that simple. Performance considerations are important with this emerging technology. Writes are fairly clean, with any overhead of applying deduplication algorithms offset by the need to store less data, thereby reducing physical movement. But reads require more effort with the system reading the data (hopefully in cache), finding the hash, interpreting the hash and then reading the native data. As such, data deduplication should be primarily aimed at write-intensive applications.
In addition, data deduplication will perform best on serial data streams with no locking involved (e.g., database applications are unlikely to be good candidates). As well, users should target applications where lots of copies are being made over time and similar copies of data are being moved. Candidate applications include backup, archiving and even certain specific data-mining operations. For some larger applications certain files (e.g., history files) will be strong candidates.
Once these are determined, despite vendor implications to the contrary, users still should assume backing up at least some of the deduped data to tape. Today, the lack of deduplication technology in tape and the absence of standards means that data must be un-deduped to be backed up to tape. This adds additional overhead, complexity and elapsed time to backup and restore operations. A common deduplication standard across disk and tape would allow faster backups and allow deduped data to be restored to any disk that supports the standard, further simplifying operations and reducing reliance on proprietary vendor implementations.
Action Item: Users should push hard for both tape and disk vendors to develop data deduplication standards to facilitate simpler backup and restore operations. Organizations should be very careful about broadly committing to a single vendor solution without understanding the holistic implications on disk-tape-disk cloning, backup and restore processes.
Footnotes: