The Bottleneck - Controller Software
Much of the June 29th Peer Incite discussion concerned practitioner views on the complexity building up in storage controllers as a result of the inclusion of a host of storage optimization features, including thin provisioning, compression, dedupe, snap shots, encryption and key management, and others. Presenters Richie Lary and Steve Sicola were asked if this expansion in software controller intelligence contributes to higher drive failure rates (the topic of the call) over historic averages from the time when controllers were responsible primarily for RAID and device I/O. The general consensus answer was “probably, because all this software makes the disk work harder”. But without a focused assessment, it is difficult to know. However, it was proposed that storage controllers and controller functionality is becoming a bottleneck to application performance (complexity of software could be one reason), and that software errors could be a growing cause of disk failure, following the adverse effect of environmental errors in vibration and cooling. And controlling environmental errors in vibration and cooling could be easier to deal with than software errors.
Moving Storage Functions to the Application
This led to a debate over whether these controller-based functions should be the responsibility of the application or embedded in the storage infrastructure. The answer to this question depends on what you define as an application. Is an application a business system like JD Edwards, an infrastructure application such as SameTime, a database application such as Open Office or CFO Central by Oracle, or a database management system or operating system? The other dependency is on the specific storage management function being considered. According to Lary and Sicola, storage systems should be designed to do only three things – be available to the application, protect data for the application, move data when necessary. As the user and vendor community settles in on this debate over time, its impact on the overarching business imperative of reducing complexity in modern IT must be considered. Open, service based standards and APIs being developed by the Cortex Developers Community and others for driving more interoperability and less complexity between a host of application and storage environments is part of the consideration.
What it Means
So what does this mean to in terms of failure rates and storage reliability? There are 3 main points:
- First, controls on the environment reduce drive failure rates. Technologies such as the Intelligent Storage Element (ISE) from Xiotech are improving drive reliability by reducing failures due to vibration and cooling problems. ISE is helping users achieve much higher reliability rates for regular disk drives enclosed in a typical storage drive bay—reducing service events and their impact on IT organizations. Because of such reliability, Xiotech provides a five-year hardware warranty with all its ISE-based devices.
- Second, the impact of disk controller software complexity on disk failure is uncertain. According to Seagate Technology, almost three-quarters of drives returned as "failed" are found to be NTF, or no trouble found. Could the source of the failure be poor controller software quality, and poor error detection and correction programming (according to Lary and Sicola, only 5% of controller software is devoted to error handling and reporting)?
- And third, as the debate over which storage optimization functions belong to the application layer vs. infrastructure layer rages on, more error management functions should be designed in to the APIs, and the impact of this transition of these functions on disk failures should be measured and evaluated from a risk, cost, and business value perspective.
Action Item: Look at the cost and value implications of storage system hardware warranties vs. maintenance plans. Push the vendor community to provide better warranties to reduce maintenance costs. Consider your own experience with disk failure rates, and the impact of CAPEX and OPEX costs of a warranty vs. maintenance plan. And, if you’re an application lead, get someone on your team involved in communities such as Cortex and VMware who are developing storage APIs. As storage functions move closer to the application stack, determine the impact on development and maintenance skill sets, testing, troubleshooting, and application interoperability.
Footnotes: June 29 Peer Incite: The Future of Storage - A discussion with technology gurus