Originating Author: David Vellante
Cloud computing practitioners led by Amazon, Facebook, Google, Zynga and others have architected object-based storage solutions that typically do not involve traditional storage networks from enterprise suppliers such as EMC, NetApp, IBM, and HP. Rather these Web giants are aggressively using commodity disk drives and writing software that uses new methods of data protection. This class of storage is designed to meet new requirements, specifically:
- Petabyte versus terabyte scale,
- Hundreds of millions versus thousands of users,
- Hundreds of thousands of network nodes versus thousands of servers,
- Self-healing versus backup and restore,
- Automated policy-based versus administrator controlled,
- Dirt cheap and simple to operate.
Traditionally, enterprises have justified a premium for RAID arrays and will often pay 10X or more on a per GB basis relative to raw drive costs. Practitioners are finding that traditional array-based solutions cannot meet the cost and data integrity requirements for cloud computing generally and cloud archiving specifically. New methods of protecting data using erasure coding rather than replication are emerging that enable organizations to dramatically lower the premiums paid for enterprise-class storage – from a 10X delta to perhaps as low as 3X.
These were the conclusions of the Wikibon community based on a Peer Incite Research Meeting with Justin Stottlemyer, Director of Storage Architecture at Shutterfly. Specifically, rapidly increasing disk drive capacities are elongating RAID rebuild times and exposing organizations to data loss. New methods are emerging to store data that provide better data protection and dramatically lower costs, at scale.
Contents |
Shutterfly: Another Example of IT Consumerization
Shutterfly is Web-based consumer publishing service that allows users to store digital images and print customized “photo books” and other stationary using these images. Shutterfly is a company with more than $300M in annual revenues, roughly one half of which is derived in the December holiday quarter, and a market value of around $2B (as of mid-2011). Its main competitors are services such as Snapfish (HP) and Kodak’s EasyShare.
A quick scan of Shutterfly’s Web site underscores its storage challenges. Specifically, the site offers:
- Free and unlimited picture storage,
- Perpetual archiving of these pictures – photos are never deleted,
- Secure images storage at full resolution.
Stottlemyer told the Wikibon audience that when he joined Shutterfly, 18 months ago, the organization managed 19 petabytes of raw storage and created between 7-30 TBs daily. Costs were increasing dramatically, and drive failure rates were escalating. In that timeframe, Shutterfly had a near catastrophe when a 2PB array failed. The firm lost access to 172 drives, and while no data was lost it needed three days to calculate parity and three weeks to get dual parity back across the entire system.
This catalyzed action, and after an extensive study of requirements and potential solutions, Shutterfly settled on an erasure code-based approach using a RAIN architecture (Redundant Arrays of Inexpensive Nodes). This strategy allows Shutterfly to leverage a commodity-style computing methodology and implement an N + M architecture (e.g. 10+6 or 8+5 versus a more limited RAID system of N+1, for example).
The firm looked at several alternatives, including open source projects from UC Santa Cruz, Tahoe-LAFS, Amplidata and others. Shutterfly settled on a Cleversafe-based system using erasure coding and RAIN. Cleversafe was chosen because it was more mature and appeared to be the most appropriate system for Shutterfly’s needs. The Shutterfly system today takes in a write from a customer to a single array, which gets check-summed and validated through the application. This data is also written to a second array and similarly validated.
A key requirement for Shutterfly and one key consideration for users is the need to write software that can talk to multiple databases and protect metadata and maintain consistency. Shutterfly had to develop some secret sauce that maintains metadata in a traditional Oracle store as well as the file system and provides access to metadata from the application. This is not a storage problem per se. Rather it’s an architectural and database issue.
Key Data Points
- Shutterfly has grown from 19 PBs raw to 30 PBs in the last 18 months.
- The company is seeing 40% growth in capacity per annum.
- The increase in image size is tracking at 25% annually.
- There is no end in sight to these growth rates.
- Shutterfly experiences typically 1-2 drive failures daily.
Some Metrics on RAID Rebuild Times
- For a RAID level 6 array a rebuild on a 2TB drive could take anywhere from 50 hours for an idle array up to two weeks for a busy array.
- Each parity bit added at the back-end increases system reliability by 100X – so a 16+4 architecture will yield a 10,000X increase in reliability, relative to conventional RAID, at commodity prices.
- Rebuild times for a 12+3 array, for example, using erasure coding start at 1-2 hours for an empty/idle array. For a full array however the rebuild times could be longer than with a conventional array.
- However the level of protection is greater, so the elongation in rebuild times is not as stressful because data is still fully protected during the rebuild. With longer rebuild times due to larger hard drive capacities, drive failures on traditional arrays are reaching the point where data loss is increasingly likely.
Shutterfly's primary driver was peace of mind during the drive rebuild process at a much more economical cost than would have been achievable using traditional RAID techniques. If a drive fails today, Shutterfly has multiple other parity bits that can protect the system, easing the pressure to recover quickly.
Object Storage and the Cloud
Organizations are increasingly under pressure to reduce costs and in particular cut capital expenditures. While the external cloud is alluring, at scale, organizations such as Shutterfly are finding that building an internal cloud is often more cost-effective than renting capacity from, for example, Amazon S3. Rental is always more expensive than purchase at scale, but rental offers considerably more flexibility, which is especially attractive for smaller organizations.
Another key issue practitioners will need to consider is the requirements of object storage. While object stores offer incredible flexibility, scale, and availability, they are new and as such harder to deploy. They also are not designed for versioning. While many small and mid-sized customers (through Amazon S3) are using object storage today, development organizations at large companies may be entrenched and not willing to make application changes in order to exploit the inherent benefits of object storage – which can include policy-based automation. Nonetheless, for firms scaling rapidly and looking toward cloud storage to meet requirements, the future of storage is object because of its scaling attributes and inherently lower cost of protecting data. Traditional RAID is increasingly less attractive at cloud scale.
What are the Gotchas?
In point of fact, RAID 5 and RAID 6 use erasure coding, but with limited flexibility on the granularity of protecting data. The Shutterfly case study shows us that with erasure coding data, reliability of 5 million years can be attained—i.e. astronomically low data-loss probabilities.
The downside of erasure coding such as the Reed-Solomon approach used by Cleversafe is that it is math-heavy and requires considerable system resources to manage. As such you need to architect different methods of managing data with plenty of compute resource. The idea is to spread resources over multiple nodes, share virtually nothing across those nodes and bet on Intel to increase performance over time. But generally, such systems are most appropriate for lower performance applications, making archiving a perfect fit.
The Future of Erasure Coding
As cloud adoption increases, observers should expect that erasure coding techniques will increasingly be used inside of arrays as opposed to just across nodes, and we will begin to see these approaches go mainstream within the next 2-5 years. In short, the Wikibon community believes that the trends toward cloud scale computing and the move toward commodity compute and storage systems will catalyze adoption of erasure coding techniques as a mainstream technology going forward. The result will be a more reliable and cost effective storage infrastructure at greater scale.
Action Item: Increasingly, organizations are becoming more aware of exposures to data loss. Ironically, drive reliability in the 1980's and early 1990's became a non-factor due to RAID. However as drive capacities increase, the rebuild times to recalculate parity data and recover from drive failures is becoming so onerous that the probability of losing data is increasing to dangerous levels for many organizations. Moreover, the economics of protecting data with traditional RAID/replication approaches is becoming less attractive. As such, practitioners building internal storage clouds or leveraging public external data stores should consider organizing to exploit object-based storage and erasure coding techniques. Doing so requires defining requirements choosing the right applications and involving developers in the mix to fully exploit these emerging technologies.
Footnotes: Is RAID Dead - a Wikibon discussion on the issues.