Over the past few years, RAID 6 has grown in popularity and has become a ‘must-have’ feature for purchasing a RAID controller. This article will explore some causes behind this booming interest, will justify reasons and dispel myths, provide models and implementations -- each with advantages and disadvantages -- and set the framework for an analytical approach to the problem and its solutions.
At the most basic level, RAID is an association of disks and relative data layouts designed to survive the data read errors from some of the components, and still allow them to be retrieved by the system. RAID 0, the basic striping model, does not allow any redundancy so, while optimizing performance; it does not provide recovery in case of data failure. RAID 5, designed to recover data from a single data failure, accomplishes this by adding one redundant check disk (“P disk” or “Parity disk”), simply calculated as the XOR of peer data. In mathematical form, it is treated as a linear equation with only one unknown variable (the missing data from read failure) and can be easily solved like any linear equation using elementary algebra.
RAID 6 extends RAID 5 capabilities to recover two data errors on the same data set. From a mathematical point of view, for RAID 5 you will need one single equation to resolve one single unknown, while in RAID 6, you need a system of two linear independent equations to recover the two unknown data. The first equation can be the same as RAID 5, adding the P disk. The second equation needs to be different and will yield a Q disk, giving “P+Q” its nickname.
In theory, this progression could continue indefinitely creating any type of M+N redundancy, but real life application limits interest to N=2, i.e. 2 concurrent and independent failures on any data stripe, that are the main target of RAID 6.
Why RAID 6?
There are two cases for which data can’t be retrieved by the disks and that are independently addressed by RAID 5: • One disk is dead, i.e. does not respond to any read/ write command and, as such, will need to be replaced. RAID 5 recovers all data from surviving peers and rebuilds the defective disk. • Disks are OK, but there is a bad block (i.e. a block that cannot be read) on a disk so that specific data can’t be recovered.
Note that from a mathematical point of view, each disk has a MTBF of about 500,000 to 1.5 million hours (once every 50 – 150 years) between “dead disk situations”. In real life, less than optimal working conditions, and most of all thermal and mechanical environments may reduce this number one full order of magnitude. Given that each disk has an independent life and any of them could fail, statistically an array of “N” disks will fail “N” times more frequently than the single disk. Combining the two, if the array counts a reasonable number of disks, and maybe, these are in the low MTBF range, it becomes likely there is an experience of disk failure in the expected array life (one every many months/ few years).
What are the chances that two disks will die at the same time (“same time” defined as a second death before rebuild completes)? Given that the RAID 5 array MTBF is proportional to MTBF^2, we are talking of an occurrence every ~1 quadrillion hours (or once every 10,000+ years) making it quite unlikely no matter the working conditions. This is a mathematically relevant issue, but not a real life case need worrying about. It needs not to be completely discarded however as some situations happen at time, even though the root cause is completely unrelated to MTBF.
A read error (unrecoverable ECC read errors) is a very subtle phenomenon statistically described as an occurring event proportional to the amount of bits read. In the case of SCSI/ FC/ SAS disks (SAS is the focus of this article, but the same qualitative discussion will apply to the three technologies) this is 1 every 10^15 bits read (~100 TB) or 10^16 bits read (~1,000 TB or 1 PB). This is called BER, or Bit Error Rate.
SATA disks, however, are much more error prone with a BER of one or two orders of magnitude higher (one every 10^14 to 10^15 bits read or every 10/100 TB, depending on disks design). If a SATA disk is 1 TB and someone reads the same disk completely 10 times, they will find a new defective block (assuming they are rated a one read error every 10^14 bits read). Two defective blocks in the same corresponding blocks in the same row is almost impossible and of the order of one event every 10^30 bits read.
However, what is the chance of two errors, one due to MTBF and one to read error? Assume we have an array of 10 SAS disks, 300 GB each with BER of 10^-15, what are the chances that during rebuild we encounter a read error? It is one every 10^15 bits read*1/8 (bytes/ bits)*1/10 (disks)*1/300GB = once every 50 rebuild. Significant, but not astonishing: an array that dies 50 times in its life has some bigger problems to care about! However, this is not statistically insignificant because it can be read in a different way: if you sell 50 arrays like the one above, at least one of them will likely experience the problem. Still small, but customers with several hundred installations will be concerned. Also, some SAS disks have a BER 10 times better, which make this problem minimal.
How about SATA? Larger disks, worse problem! They have a lower BER -- an even worse problem. Assuming the same array as above of 10 disks, but with 500 GB size and 10^-14 read error rate: 10^14*1/8*1/10*1/500GB = once every 2.5 rebuild. This is something! And we’re talking about a 5 TB array that, if not very common, is well within reach of current technology. This means that for every 2.5 arrays of this size, the user will lose a block every time it rebuilds – and error message like: “Read Error at LBA = 0xF43E1AC9” doesn’t help much. What is it? 0xF43E1AC9? Empty space? Kernel data that will blue screen at next boot? Libraries that nobody uses? My bank account? There is no way to know and the only way out is to restore the data from backup, which will take a very long time and…what are the chances that you have a read error from tape while restoring 5 TB of data? Much higher than disk, so this becomes an endless problem.
Action Item: RAID 6 is important to be able to recover not two disk failures, but one disk failure and a single read error from surviving peer disks. Chances of this happening are growing as disk size increases, but most of all, the prevalence of low quality SATA disk has boosted it by a factor of 10 or 100. It all ends in trade-offs about pricing and risks one is willing to take: SAS and high-end drives have much better MTBF and BER so they make this issue much smaller (though not impossible), while low-end SATA creates an instant savings in CAPEX for purchasing them, but create a much larger window to a double failure and thus a R6 market.
It should be noted that some of the risks listed above can be attenuated using other techniques, such as Patrol Read where the disk is scanned periodically to avoid bad blocks that, in the event of another disk failure, may create an opportunity for a double failure in that stripe. At the present moment, RAID 6 seems much a more suitable, and possibly needed, solution for SATA drives, but one cannot rule out the opportunity for its usage in SAS disks if capacity and array size keep growing according to market predictions.
Footnotes: