See also related posts on my blog.
A couple more thin provisioning caveats
From The Storage Anarchist, Wednesday, Nov 7, 8:30AM.
In addition to other caveats, customers considering thin provisioning should be aware of two oft-overlooked factors before deploying this technology:
- Performance: By increasing the utilization of storage, you are (by definition) placing more data on each spindle, and likely using fewer spindles to support the sum of all the workloads that are sharing the devices that provide the capacity for the thinly provisioned devices. Doubling the utilization effectively doubles the access density and doubles the spindle contention. Dependent upon the performance requirements and workloads of all the applications that share the spindles, response times and throughput of ALL applications may suffer because the spindles are unable to support the higher workloads with reasonable response times.
This is specifically why the SPC-1 is irrelevant to the discussion of thin provisioning. All SPC-1 configurations leverage sparse allocation in order to attain the highest possible results - often using far less than 20% of the capacity on each spindle. Increasing the utilization to a more cost-efficient 60% requires only 1/3 of the spindles, but the effective SPC-1 IOPS and response times are likely to be far worse than even 1/3 the IOPS or 3x the response times of the published results, given the added overhead of device contention. The SPC-1 does nothing to predict the relative performance of different storage devices under this sort of (more realistic) workload.
- Fault domains: Over-provisioning depends on having multiple thinly provisioned devices (LUNs) sharing a common pool of spindles, and given the increased utilization, more LUNs are most likely to be sharing these spindles than would occur if using "fat" allocation. The first-order risk is somewhat obvious - if any of the applications unexpectedly consume all of the physical storage in the pool, ALL the dependent applications (LUNs) will have their writes rejected, potentially with serious consequences. Aggressive monitoring and the ability to respond to the alerts provided by the implementation in a timely manner (to add more storage) will mitigate this risk sufficiently for most.
Data corruption of the storage pool is not so easily avoided, however. Although rare, blocks do occasionally get corrupted (as I've also discussed on my blog). Such corruptions are most usually limited to only a few blocks, although they are often "silent" and may go unnoticed for years.
But there is a distinctly higher potential of a double-drive failure occurring in a RAID-5 group - a probability that increases with the size of the drives being used and also somewhat by the workload the RAID group is supporting (disk drives are mechanical and do wear out faster under heavy loads). The challenge is that such a double drive failure can result in the loss of nearly two full drives worth of data blocks (subtracting out the parity overhead), 60% or more of which will be real data (due to the increased use of thin provisioning). These data blocks will probably be irrecoverably lost - and (by definition) this means that every single LUN that was sharing the pool will have at least some data loss (if not total destruction, if significant portions of the layout metadata is lost). And as noted, the number of LUNs impacted will likely be significantly greater than if using "fat" provisioning.
Within most recent-generation arrays, like the USP-V or the DMX, customers will most likely be advised (by vendors and best practices) to use RAID-6 for all the RAID sets used as thin provisioned pools, as this will help to minimize the probability of data lost (RAID-6 can tolerate the loss of two drives and still maintain the integrity of the data). Note that while this approach can mitigate most of the risk of data loss, RAID 6 may have additional impact to the performance of thin devices.
But if you were to leverage the newly-announced USP-V capability of thinly-provisioning externally-virtualized storage, RAID-6 is very likely NOT to be an option, because most older storage arrays do not offer the RAID 6 capability. Importantly, in this configuration the USP-V hardware and software cannot do anything to protect the data from corruption in the external storage, aside from perhaps mirroring the volumes internally (and note that it is not clear if this is even a supported option).
It's also important to understand that this risk is not limited to thin provisioning - this is probably why Hitachi & HP documentation does not recommend externally virtualized storage be used for heavy I/O workloads other than backup or archive. Users should demand access to and carefully consider the recommended use cases for thinly provisioned external devices in light of this very risky reality.
Action Item: Thinly provisioned external storage should be considered very carefully before it is committed to as an infrastructure-wide strategy. While some will view these concerns as spreading fear, uncertainty and doubt (FUD), buyers should explore this topic with all potential suppliers and do their own homework before dismissing its significance.
Footnotes: