Written with David Floyer
Senior storage managers are witnessing an unprecedented surge of innovation that is causing major dislocations to their strategic plans. By 2015, the tried and true infrastructures of the past twenty years will be flipped upside down and on their sides; leaving in their wake an entirely new approach to managing storage.
At the core of these changes is the increasing granularity of the storage hierarchy. Specifically, after two decades of function moving toward the SAN, function is increasingly moving up the stack toward the server. As well, the hierarchy is becoming more granular in nature.
An important discussion going on in the community is how to best exploit this hierarchy in a world that is operating at Web scale and bringing new levels of complexity to IT organizations. The clear answer is automation. When wanting to guarantee a level of service for an application, what is not as clear is whether moving data throughout the stack to optimize on both cost and performance is preferable to avoiding movement as much as possible and optimizing on access to performance. In other words, is the concept of storage tiering as we know it (e.g. moving data to the lowest cost-per-bit device that will meet the performance requirements) the best approach or should IT organizations consider acting like next generation cloud service providers and think about IO Tiering?
The Big Storage Disruption
There are eight major storage disruptors the Wikibon community is following, including:
- Flash – driven by mobile communications and consumer markets, flash storage holds the promise of addressing the "horrible spinning disk" which has been the Achilles heel of system design for the past five decades. With flash, data that needs lots of IOs can be serviced from a persistent device with access times that can deliver orders of magnitude better performance relative to spinning hard drives.
- Huge, inexpensive SATA disk drives - At the same time we're seeing continued massive advancements in the amount of data under a disk drive actuator – e.g. 10TB drives are on the horizon, with very poor relative bandwidth to these devices, the disparity between tier 1 and tier N storage is growing dramatically.
- Erasure coding – this (and other similar technologies) addresses the major drawbacks of RAID protection generally and specifically onerous rebuild times. Erasure coding popularized by the likes of Cleversafe offer the ability to store archived data with very low cost and much faster recover times than RAID.
- The cloud – and the availability of object store, pay-by-the-drink, RESTful APIs that change the way organizations provision, manage and charge for storage.
- Volume managers that can manage volumes throughout the storage stack – a volume is a logical chunk of storage that is seen by the application. The volume can be either a collection of files, a collection of blocks or a collection of objects. Modern volume managers can span the entire data center infrastructure from server all the way to the cloud…and move data from DAS to SAN to cloud storage dynamically.
- Virtualization – changes the requirements for storage by providing a storage infrastructure which can be tied to the application. Specifically, allowing a virtual machine to be dedicated to an application. Virtualization provides a different type of storage subsystem (i.e. a VMDK) that can be tied directly to specific applications and moved dynamically around the network (using volume DR managers). Virtualization provides a management framework for monitoring and allocating IO resources and completely changes the way we think about storage management, data protection, data movement and the like.
- Automated tiering software that can automate the movement of data throughout the storage hierarchy. The drawback of tiering is it works within the array only and has a limited scope. Initiatives such as EMC’s Project Lightning are expanding the scope of tiering software into the server. Presumably the cloud is on the horizon but this would be a mammoth extension because it would require general acceptance of a volume manager that speaks the languages of memory, tradition storage and the cloud.
- Cloud service providers are becoming increasingly important both as a supplier to IT organizations and as a benchmark model for traditional IT shops. CSPs as they are known are managing IT operations for profit versus traditional IT which has always been seen as a cost center. The notion that IT is increasingly becoming a revenue generation center is underscored by CSPs as well as large Internet giants.
Shifting Storage Requirements
Wikibon community practitioners tell us their main storage needs include:
- Data should be automatically migrated to the most cost effective storage hardware while at the same time providing the protection and performance required by the applications accessing the data;
- The infrastructure must provide the management controls necessary to provision storage and monitor availability, performance and security;
- Full automation of the storage solution across the stack.
In the good old days of just spinning disk drives life was pretty simple. Drives could provide 100-200 IOs per second and there were choices of spin speeds (e.g. 15K or 7.5K devices. Capacities ranged from roughly 400GB – 2TB with a span of 4 or 5:1 in capacity/performance delta across the spectrum. In other words, the variation across the storage stack was not huge. The optimization metric was cost/bit and the tradeoff was clear – performance versus capacity.
In thinking about how storage is currently allocated, it's done assuming a worse case scenario. In other words, at the start of the project, you assume a worse case and over provision capacity and performance. As such, movement of data becomes expensive, difficult and risky. So people try not to move data except when they have to. It is much easier and safer just buy more storage instead and keep data where it is.
In the 2006-2007 timeframe, organizations created a brute force approach to optimize storage using storage virtualization (e.g. with IBM's SVC) and creating a default tier 2. The idea was to avoid tier 1 where possible, create a standardized tier 2 configuration that was the automatic default for the business and force business heads to make a case if they wanted expensive tier 1 storage. While this worked and could automate data movement for any attached devices, it didn't scale well.
Storage federation extends this notion and solves other migration problems but it is largely confined to a narrow scope, e.g., a certain set of arrays from a certain set of manufacturers with narrow control of that specific storage. As an example this approach wouldn't control flash drives on the server (e.g. Fusion-io) and it doesn’t control storage going to the cloud.
What's changed? Today, the maximum IO density is much higher (and much more expensive) and ranges from flash devices operating in microseconds down to distributed cloud storage operating in seconds. IO density (or access density) metric is IOs per terabyte per second. The range of that metric has become huge—e.g. flash has access time or latency measured in microseconds and the access density is in the hundreds of thousands IOs/TB/Sec. Compare that with the cloud – where IO access times are measured in seconds and access density is measured in the tens of IOs per terabyte. In other words, there exists a huge disparity which means that managing storage becomes much more cost sensitive. Specifically, allocation mistakes are much more expensive than ever before. In addition, as data patterns and application requirements change, you have to automate the movement of data because of the overwhelming complexities involved.
To Tier or Not to Tier
The crux of the debate is what are the technologies that are necessary to optimize the cost, lower the risk and manage storage effectively? Where should the locus of control reside and where should you place your bets as a practitioner? In other words what's the right tiering approach?
We've identified three scenarios for managing next generation storage:
- Volume Management-centric(e.g. Nimbula) – meaning the system gets defined and an automated volume movement system manages volumes based on cost, availability, security and performance requirements of the applications. There are two main components here – a volume manager and a volume mover. Importantly, this approach spans the entire spectrum, from PCIe flash all the way into the cloud. Most of these volume management systems operate at the whole volume; extension to operate at a sub-LUN volume are required.
- Array-centric (e.g. EMC FAST VP) rather than move volumes you move pieces of volumes (i.e. blocks within volumes). Moving hot files up the stack and providing an inside the data center solution. With Project Lightning (for example) this notion gets extended into the server. Again, much more work is needed needs to be done for this to reach storage in the cloud.
- IO Tiering (not data movement-based tiering) (e.g. SolidFire). This approach focuses on IO as the control point. The idea is to place all active data in compressed, deduplicated flash storage, making it cost competitive with today's tier 1 block-based arrays. Then access to data is controlled, meaning you specify the IOPs, latency and bandwidth allocated to target volumes. In this scenario, the maximum underlying performance is fixed – i.e. all the data can have the same performance without moving data through the stack. This works by pinning the number of IOs (or other metrics such as gigabytes transferred or latency guaranteed) for each volume that are allowed and charge for that performance accordingly. This approach is increasingly a preference of cloud service providers because they want a simple way to charge. They have no vested interest in moving data to a T2 / T3 solution – even though it cuts their internal cost—CSPs find this approach simple - like a gold/silver/bronze/Nickel schema. Figure 1 in a posting defining IO Tiering gives an simple example. The service level is guaranteed and capped with a clear price. As an example, the top tier gets a maximum of 5,000 IOs per terabyte, the silver tier 1,000, the bronze tier 500 IOs/TB/sec, and the lowest tier 50 IOs/TB/sec. Note, the lowest tier provides less IO capability than a SATA disk, but can still be justified because of the low cost without data access. From the service provider's point of view, the sum of the guarantees can be significantly greater than the total performance capability of the storage, with almost zero probability of not meeting performance guarantees, because the underlying media always has the potential to perform faster.
These approaches are not necessarily mutually exclusive. For example you could combine 1 and 3 and if there came a point where data is stale and needs to be moved to the cloud (archived) you could do that. The same thing with 1 & 2.
The advantages of a volume management approach are that it has a wide scope across the entire spectrum of the storage hierarchy. This means you can move data from one target to another without disruption because the volume manager is maintaining the integrity of the move. The disadvantage is that the operation has to be done through the volume manager, and is a server-based solution with more overhead. In addition you'll have to pay a license for that volume manager which often is based on a per TB basis. As well, volume managers tend to become deeply embedded and pervasive once installed (i.e. a big lock-in factor).
The advantage of an array-centric approach is that it extends the existing storage infrastructure further throughout the storage stack and leverages installed processes. As well, it's hardware based so it tends to be higher performance and can be real time/dynamic. Said another way, if you're an EMC customer, with FAST VP you'll be able to keep doing what your doing and getting incremental evolutionary value out of your storage investments. The disadvantage is that if you want to extend throughout the stack (i.e. if you want to move data out of the array and somewhere else) you have to manage the migrations yourself.
The advantage of the Tiered IO approach is that it provides an application view and when used with a hypervisor, you can easily guarantee quality of service (QoS). This is particularly appealing for cloud service providers (and IT enterprise IT managers wanting to compete with cloud providers) that want to charge for application level performance. As well, changes can occur quickly and automatically; for example if there's a performance problem, the system can automatically adjust. The disadvantage of this approach is you can't take this approach with spinning disk because the pipe into each drive is too small. As a result this approach is really designed for applications that are tier 1 class. In a data center that is large enough to have dedicated arrays for tier 1 and other arrays for tier 2 and 3, this is a good approach (i.e. for large data centers and service providers). It is less attractive for for small businesses.
Action Item: Senior storage managers must take a hard look at what's most important for your business. If lowering the cost of tier 1 storage is your priority then low cost SATA and simple tiering mechanisms within the array are your friend. If response times are important and the business has external users that need very good and predictable performance levels, you should strongly consider tiering by IO access, especially if you can charge for it. Longer term, volume managers are complimentary to these approaches because they can extend to the cloud; and should be considered a strategic investment.
Footnotes: