The three cardinal sins in storage are: 1) Causing data corruption; 2) Causing data loss; 3) Causing lack of data availability. There are many vendors that focus on #3 but not many who focus on #1 and #2. And RAID as a data protection mechanism doesn’t inherently address #1.
Specifically, few storage vendors offer data integrity guarantees at the SLA level. Some (e.g. Hitachi, Xiotech and others) offer reliability/availability guarantees. The storage industry needs a new mindset that goes beyond the notion of drive reliability into data integrity.
Instead of relying on the disk to provide data integrity, increasingly the industry needs to implement software to provide data integrity and recovery. ZFS, for example, is a step in the right direction - it is smart enough to protect against silent error corruptions by using data integrity checks. Some – e.g. Cleversafe - are providing an SLA that says not only is the data available but the bits are right.
The traditional way of storing data is to store files and volumes together. If a backup is needed, a copy is taken and stored in another location or the data is replicated over a network. Many copies of data result. That is fine for transactional data, but a large overhead (~300%) for media files, archives, and large unstructured data sets (e.g., e-mail), which are large and rarely accessed.
In 2009, Cleversafe introduced the concept of dispersed storage using a derivative of Reed Solomon encoding providing M of N fault tolerance – the successor to RAID for petabyte-sized repositories of unstructured content. The data is broken up into slices (say 16) that are spread across multiple arrays in multiple locations. The Cleversafe algorithms then allow the data to be located and reassembled as required. If up to six of those sites are down or destroyed, 100% of the data can still be reassembled without loss. In addition, if data from a site is stolen, no data can be reconstructed.
To guarantee data integrity, Cleversafe’s storage nodes compute and store integrity check values for each slice they keep. The integrity values are proactively checked for correctness by a background process, meaning the system isn’t waiting for a read to discover an error. This is crucial for long-term retention and preservation of data.
Additionally, the slice server will check the integrity of any requested slice prior to returning it to the client. If found to be invalid, the server will respond as if it does not have the slice, therefore preventing the corruption from propagating to a higher level. As a last line of defense, a data-source-level integrity check value is computed and compared by the client after it has reassembled a data source. The outcome - bad data will never reach the application or end-user.
As a result, Cleversafe is able to offer data integrity SLA’s to its customers, since it can always verify the integrity of data each time it is read. Our customers look to address all three of the cardinal sins of storage by leveraging Reed Solomon M of N fault tolerance to replace RAID, and leveraging integrity checks to address data corruption.
Action Item: As cloud computing and big data applications become more prevalent, the storage industry needs to move beyond the mindset of providing high availability into the realm of data integrity.
Footnotes: