On Monday June 7, 2010, Permabit announced Alberio, a new technology that speeds deduplication for primary storage. Albireo is positioned as an offering to enable storage array manufacturers to compete with NetApp's Deduplication (formerly A-SIS; Advanced Single Instance Storage). NetApp has cornered the market on deduplication for primary storage by bundling deduplication into its arrays at no additional charge. This has increased the appeal of NetApp products and allowed the company to market storage efficiency effectively as a major theme.
Albireo is different in concept because it is delivered as a software development kit (SDK). While this means OEMs must do some integration, the upshot is the flexibility of an all-software platform is enticing. Albireo is the first product in the primary deduplication space that promises to deliver the following:
- A flexible array of deduplication services,
- The ability to optimize file-, block- or object-based storage systems,
- Operate at the file or sub-file level,
- Optimize inline (i.e. real-time) or post-process,
- Eliminate application performance concerns,
- Scale from a single storage controller to a cluster of storage controllers or a cluster of deduplication appliances communicating with a storage system over an industry standard Ethernet interface.
How Does Albireo Work?
Figure 1 shows a diagram presented by Jered Floyd, Permabit's co-founder and CTO who presented to the community. This diagram shows Albireo's parallel process:
- The green box in the center represents the vendor's existing storage stack. At the top of the box are the vendor's existing interfaces (e.g. block would be iSCSI or FC; file would be NFS or CIFS). Data flows from those interfaces to the existing file system stack through the vendor's data placement and data protection scheme (e.g. RAID) into their existing storage infrastructure stack. The key point is Albireo doesn't interfere with the write or read path in any way. When an array vendor does an integration, it takes the data into the system and copies it to the Albireo API along with internal placement information (e.g. a virtual LUN or an iNode offset in a file system).
- Next on the diagram is Albireo's segmentation engine which allows the system to identify boundaries within a file. Note: in a block situation the vendor may just choose a standard 4K block size. The segmentation engine breaks larger objects into variable-sized chunks to improve dedupe efficiency.
- Once data is segmented it is then 'fingerprinted' using a known hash algorithm (SHA-256), which assigns a unique identifier to determine if the data exists already in the storage system.
- Next is the 'secret sauce' of Albireo and one of the most difficult challenges in primary deduplication-- namely determining quickly whether the system has seen the information before across a large storage pool. Conventional hash tables only work well for small data sets. Across hundreds of terabytes or even multiple petabytes, a hash table data structure will require too much overhead (e.g. memory) to maintain the data structure and will require paging to disk which causes unacceptable performance. Albireo uses a proprietary indexing method to determine if the data chunk has been seen before very quickly. Albireo does this very efficiently with data in memory; avoiding the need to go to disk. This index lookup operation occurs in 10 microseconds, and the end-to-end process with all latencies takes about 40 microseconds in total.
- Once the index lookup is complete-- if the information has not been seen before, it's added to the index. If it has been seen, the system makes an asynchronous callback to the vendor's system providing a 'duplicate advisory' or a 'deduplication notification' - e.g. the data that was just stored in block X was previously stored in block Y or file A in this offset is also stored in this location. It is then the responsibility of the storage vendor's stack to free up the duplicate space and make it available to the pool as free space. Permabit claims this is a 'lightweight' integration exercise with a weeks or months effort.
Figure 2 shows the deployment options for Albireo. There are three deployment options for the technology that span inline, parallel, and post-process. The parallel and the post-process implementations have no performance impact; inline deployment will create some latency on each request, which Permabit and its OEMs will attempt to mask with parallelism.
Changing Storage Optimization Landscape
Storage optimization technologies generally and specifically data deduplication and compression are moving to primary storage. Key considerations as to where these technologies apply include use case, performance, costs, and effectiveness. Wikibon has developed the concept of CORE - Capacity Optimization Ration Effectiveness to assess the business value of optimization solutions. Our preliminary take on CORE as applied to primary storage optimization was published last quarter; where we concluded for primary storage - speed is critical. We received significant feedback suggesting: 1) CORE is highly dependent on use case and data type; 2) the methodology over-weights performance and 3) we need to better account for read:write ratios in our assumptions.
Albireo has advanced our thinking on CORE in a number of ways, which were brought out in community comments. Specifically, the flexibility to support multiple use cases has clear value, and we have not explicitly accounted for that in CORE. Storwize, for example, scored very well in CORE because it has very low latency in primary storage use cases. However there are other values that Albireo brings that we must assess in addition to latency; not the least of which is the flexibility and embedded nature of the product. As well, the cost of products like Albireo and NetApp Deduplication are 'fuzzy' because they may not be charged for explicitly. Our intent is to run Albireo through CORE once we have had an opportunity to further assess performance overheads, costs, likely deduplication ratios and use cases.
On balance, the Wikibon community was in agreement on the call that storage optimization function will increasingly become embedded as a feature of arrays. Albireo's ability to support block, file and object-- unified storage essentially-- is unique in the market and also adds business value to OEMs looking for flexibility. The keys to Albireo as we see them are: 1) the software-based model and 2) its proprietary indexing scheme. The next key for Permabit is to announce customers which it expects to do in the second half of 2010.
Action Item: Primary storage optimization has been popularized by NetApp's Deduplication and has given the company a strategic advantage relative to other array suppliers-- making deduplication a standard feature set of arrays. Products like Permabit's Albireo pivot off this trend and represent the next generation deployment model for storage optimization in primary systems. CIOs should expect this capability to be a fundamental offering of primary storage systems and push vendors to demonstrate a roadmap where data reduction IP can reside throughout the storage stack without disruption.
Footnotes: