Contributing Practitioner: Burzin Engineer
On June 11th 2009, the Wikibon community gathered for a Peer Incite Research Meeting with Burzin Engineer, Shopzilla’s Vice President of Infrastructure Services. Shopzilla is a comparison shopping engine born during the dot.com boom. The company has survived and thrived, serving approximately 19 million unique visitors per month. Shopzilla’s secret sauce is its ability to refresh more than 20 million products, six times per day while its nearest competitors can only refresh their product offerings once every 3-4 days.
With no shopping malls or real estate footprint, Shopzilla is an IT-driven business with an intense focus on International 24x7 operations. Its two main data centers each have thousands of servers with several hundred TB’s per location. Shopzilla HQ supports these locations with more than 400 test, development, and Q/A machines.
At points during the mid-2000’s, Shopzilla’s growth required adding 100 full-time IT equivalents (FTE’s) year-on-year. This brought huge challenges for Shopzilla’s infrastructure, especially because its developers were reticent to get rid of data. Facing 200% data growth, increasing costs and ever-growing power consumption, Engineer decided to implement a new architecture using data compression on primary storage.
Shopzilla’s solution placed redundant IBM Real-time Compression STN-6000 compression appliances in front of NAS filer heads. All test and dev I/O’s are sent through the appliance, which uses highly efficient, lossless data compression algorithms to optimize primary file-based storage. Shopzilla has experienced a 50% reduction in storage capacity where the compression solution is applied. David Burman, a practitioner, storage architect and consultant at a large financial institution, supported Engineer’s claims and indicated his organization has seen upwards of 60% improvement in primary capacity and in some cases 90%.
Compress or de-dupe?
Engineer and other Wikibon practitioners and analysts on the Peer Incite call arrived at the following additional conclusions pertaining to compression as applied to primary storage:
- Compression is the logical choice for primary storage optimization due to its effectiveness, efficiency and performance. Specifically, according to both practitioners on the call, the IBM Real-time Compression product actually improves system performance, because it reduces the amount of data moved and pushes more user data through the system.
- Primary data compression is complementary and additive to data de-duplication solutions from suppliers such as Data Domain, Falconstor, Diligent, etc. Shopzilla compresses data and then sends compressed data to a Data Domain system as part of its backup process. Shopzilla tests show that the effects of this approach yield higher overall (‘blended’) reduction ratios than a standalone data de-duplication solution with no compression on primary storage.
- Unlike array-based solutions, compression appliances can be deployed across heterogeneous storage.
- All the array functionality (e.g., snaps, clone, remote replication) is not affected, and performance is enhanced because of the reduced amount of data
- Primary compression solutions, today, are file-based. While IBM Real-time Compression has promised SAN-based solutions in the future, the complexities of this effort, combined with a large NAS market opportunity have focused IBM Real-time Compression and slowed down expected delivery of SAN solutions.
- Re-hydration of data (i.e. uncompressing or un-de-duping) is often problematic in optimized environments; however IBM Real-time Compression’s algorithms reduce compression overheads to microseconds (versus orders of magnitude greater for in-line data de-duplication) minimizing re-hydration penalties.
Implementation Considerations
The ROI of primary data compression is substantial, assuming enough data is being compressed. In the case of Shopzilla, Engineer estimated that its breakeven was somewhere around 30-50TB's or 10% of the test and dev group's approximately 300TB's. Users are looking at a six figure investment to install redundant appliances and as such should target data compression at pools of storage large enough to payback (i.e. 50TB's+).
Burzin's key advice to peers is: Once you've selected target candidates for primary data compression, you have to plan carefully for the implementation. Specifically, users should take the time to understand the physical configuration of the network and re-visit connections and how to best bridge to the IBM Real-time Compression device. Users should expect some disruption and plan accordingly.
Conclusions
On balance, the Wikibon community is encouraged by the IBM Real-time Compression solution and its impact on storage efficiency (50%+ capacity improvement), backup windows (reduction of 25%), and overall business value. In general, it was the consensus of the community that the time for storage optimization is now and technologies including data compression for primary storage should become standardized components of a broader storage services offerings.
Over time, the Wikibon community believes data reduction technologies such as data de-duplication and primary storage compression will be increasingly embedded into vendor infrastructure portfolios. In the near term, however, compression technology as applied to primary storage and delivered as an appliance holds real promise.
Action Item: Insane storage growth and the economic crisis have hastened the drive to efficiency and storage optimization. Organizations still have not addressed the root problem, which is they never get rid of enough data. Nonetheless, data reduction technologies have become increasingly wide spread and eventually will be mainstream. Where appropriate — e.g. file based NFS and CIFS environments with a small percentage of images, movies, and audio files — users should demand that storage is optimized using data reduction techniques. In-line compression appears the most logical choice to focus at primary data storage requirements.
Footnotes: