An Agony in Eight Fits1
Fit the First: Introduction
Wikibon has been following the evolution and adoption of flash for a number of years. We were delighted to talk to Mike Prepelica, Director of IT for Revere about his decision to invest in flash-based storage on the server instead of a SAN-based storage for his core IT systems.
Over the last year Revere Electric, a century-old, family-owned, Chicago-based electrical supply company with about 200 hundred employees, grew its business 20% without adding any headcount whatsoever. This means that overall productivity grew by 20%. Usually business growth mandates an increase in headcount.There could be a number of reasons for the productivity gains:
- Slack in the system and spare capacity in the workforce;
- Employees working additional hours without pay because of the recession;
- Improvements in business processes;
- Lower prices that increased inventory turn but reduced gross margins;
- Improvements made to the IT systems;
- Etc.
Bottom Line
Revere believes that the biggest factor in the productivity improvement was the improvement in the performance of its IT systems as a result of installing Fusion-io flash storage on the servers. The performance improvement allowed Revere to exploit their Epicor Eclypse ERP system more effectively, and allowed the users to be much more productive. This productivity improvement is estimated to have generated over one million dollars ($1million) in business savings for Revere over three years that will go direct to the bottom line.
Fit the Second: Revere IT Infrastructure
Revere is an above average user of IT systems. Of the 200 employees, 125 are users of its IT system. Revere runs its business on an Epicor Eclypse™ ERP system, a specialist system designed for distribution companies. All day-to-day operations, including finance,are supported, and the same system is also used extensively for planning.
This ERP package is highly integrated and flexible, based a single interconnected database. However, the very characteristics that makes Eclypse so valuable to Revere put very heavy demands on the data storage system. To deal with the heavy I/O, Revere had an old 42-disk Storage Area Network (SAN), with the highest speed 15K SAS disks available at the time. The pressure of growth and increasing sophistication of usage of the ERP system meant that response times of many of the functions were going to hell in a hand basket; Revere was planning an expensive SAN upgrade to try and help restore adequate service to its customers.
Rather than update the SAN, Revere's Director of IT Mike Prepelica chose to install two 320-GB flash ioDrives from Fusion-io. The initial reason for the choice was that it was less expensive than the maintenance of the current SAN. The second reason was a strong recommendation for Fusion-io by the ISV Epicor, which knows that the true potential of its Eclypse system is constrained by the slow speed of storage. Storage is mechanical, and mechanical disk speeds have not improved. You can store much more data on a disk, because the disk heads have followed Moore’s Law. But the speed of access has not changed at all. The following is a quote from a previous Wikibon article:
- Spinning disk sucks. Sorry, but it's the truth. Rotating storage is painfully slow. The system is waiting for spinning, mechanical rust. This arcane approach, which has been in place for more then five decades, is forced on systems architects, application developers and ISVs. Access to mechanical storage is so, so slow – measured in milliseconds. Even the use of flash storage technologies that use the ancient channel and storage protocol paradigm are still very slow compared with processor and RAM speeds.
- From a systems perspective, the only thing good you can say about storage is it’s the only technology that is persistent except for very expensive RAM storage protected by batteries. This fact has allowed the storage industry to extract rents from users for decades.
Epicor as an ISV is ahead of the curve in understanding the benefits created by making the application as flexible as possible and eliminating the I/O waits by getting rid of the SAN and disks and placing the whole application in flash memory next to the server. The dramatic speed-up of the system-IO enabled must faster system response times, and this in turn enabled the user response times to be improved. Epicor understands that flash storage close to the server would have a profound impact on the performance of all the systems, and improve the productivity of Revere's internal users, as well as the external customers who have access to Revere's IT systems.
Fit the Third: Revere became a RARC
Revere is now a Rapid Response-time Company (RARC), which delights its customer, suppliers and partners, and lets it turn on a dime. “As soon as I installed the ioDrives, the performance just blew us away,” Prepelica said. Here are some proof points:
- Processing daily purchase orders took 20-30 minutes on the SAN. On the ioDrives, it took less than one minute.
- This allows the purchase department to complete the processing of all new orders and submit any corrections all in the same day.
- Rebuilding the company's business intelligence cube took three-to-four days on the SAN, in part because IT could only run the rebuild during off-hours. On the Fusion-io system it took just four hours, allowing overnight rebuilds. By coincidence, at the time the solid-state system moved from the text environment to production, modifications needed to the cube required three rebuilds. That took just three overnights rather than nearly three weeks.
- This allows planners to start planning three weeks earlier, and be able to revise cubes and get the data the day after revising cubes as planning progresses. This is a major improvement in the quality of the planning process, the productivity of the planning department and the ability of the company to react quickly to changing market conditions.
- The response times at the distributed call center increased significantly, and the call rate has increased by 19.7% since 2010 with no increase in headcount.
- Overall, the solid-state system provided much better response time on all tasks for the 125 users of the systems.
- The impact of improved and more consistent response times is to improve the productivity of end-users very significantly, and improve the agility and productivity of the company.
Fit the Fourth: The Economic Impact of Rapid Response Time
Most of the original work done on the impact of system response time on productivity was done 20-30 years ago, but is just as valid today. The Economic Value of Rapid Response2 is a good summary of the findings.
- There are two components of response time, system response time and user response time;
- Halving system response time reduces overall response time by ~20%;
- The improvement in productivity continues to go down as system response time become sub-second, especially for expert users;
- Quality of user work (mistakes avoided) improves significantly as response time improves;
- Systems become more efficient as system and user response time improve (reduced task switching, fewer threads, fewer I/O, less memory and CPU usage) – this helps to make response time more consistent;
- Consistent response time is very important to achieve the productivity gains.
Figure 1 gives a vivid illustration of the impact of improved system response time on overall response time, and the impact of productivity. Reducing the system response time from 3 seconds to 0.3 seconds, reduced overall response time from 20 seconds to 10 seconds, and the productivity of system users was increased by 106%.
Revere’s system response times were reduced by at least a factor of 10 as a result of implementing the Fusion-io flash storage. The use of other flash approaches would have improved response time but not nearly as much as just putting the whole system next to the processors.
Fit the Fifth: Deep Dive - The Technical Reason for Improved Response Times
This section dives into the technical reasons for the 10-fold improvements in system response time – they can be skipped for those wanting to get to the bottom line!
The technical reasons for the 10-fold improvements in system response time from the ioDrive:
|
Fit the Sixth: The Real Business Impact of Flash Storage
The ioDrives in the server improved the I/O system from a highly variable access times of over 20 milliseconds to a consistent response time of 1 ms. This resulted in an improvement in user response time, and an improvement in general productivity of the users. The resulting productivity increase was a major contributor to enabling Revere to absorb the 20% of extra business with no increase in headcount. Without any productivity gain, the company would have had to hire about 40 people to take care of the growth. Some of the factors that contributed to improvements were detailed in the introduction, but the biggest single factor was the improvement in system response times, and the resultant improvements in user productivity.
Making the assumption that 25% of the overall benefit came from system response time (probably a conservative assumption), the flash storage enabled a reduction in headcount of 10 people. Again assuming a conservative average of $36,000 per year for fully-loaded staff costs in the distribution industry, the overall saving for the company was $360,000 every year. In a three year business case that would mean an addition of over $1 million dollars in business savings direct to the bottom line.
This came as an unexpected bonus. Revere cost-justified the move to solid state based on the savings it would realize over replacing its existing SAN, which was running out of power. However, the greatest benefit to Revere is that it has become a RARC.
Fit the Seventh: Summary
The Fusion-IO system also has provided Revere Electric the benefit in its highly competitive marketplace of increased responsiveness to user needs and market changes. That increased response time means employees can answer customer questions, handle their orders, and respond to their needs faster. The result is that customers get better service. And it means that managers can get the reports they need to make business decisions faster, at RARC speed.
Fit the Eighth: Recommendations
Revere's original ROI study for installing the Fusion-io drives was based on an expectation that performance would improve to the old level and the cost would be about the same. In retrospect, Mike believes that he should have led with the business benefits of improved response time for the end-users and for the business as a whole. That would have ensured that the fastest I/O solution would have been selected, even if it cost the business more.
Integrated database packages or home-grown systems are an excellent starting point for implementing flash storage on the server to create rapid response time systems, the faster the better. A ten-fold improvement in system response time should lead to a 50% improvement in end-user productivity.
Data held on slow spinning rust is not getting any faster and should be used for archiving unstructured files. If high performance 15K drives and or SSD disks are needed to meet performance requirements, flash storage in the server will almost always produce a lower cost result with better response times.
ISVs should start designing their systems for flash memory next to server and start to use architectures such as the Fusion-io VSL memory subsystem architecture to further reduce response times by eliminating the I/O stack for 90% of the I/Os and replacing it with direct calls to the flash controller. These systems will help the organizations that use them to become RARCs.
CEOs and CIOs should pursue RARC aggressively with forks and hope, and set goals to become a Rapid Response time company. CIOs should be aggressively looking for ways to provide the fastest possible response times by exploiting the performance and persistence of flash memory as close as possible to the server. In architecting the systems of the future, Facebook and Apple are designing the core systems that will deal with the 90% or more of the I/Os and 10% of the data with flash memory very close to the processor. The remaining 90% of the data and 10% of the I/Os will be on low performance and low cost SATA spinning disk. This architectural approach is referenced in a Wikibon posting “Littles Law and Lean Computing”, and the potential design of low cost geographically dispersed archiving systems is discussed further in a posting “Reducing the Cost of Secure Cloud Archive Storage by an Order of Magnitude”.
Footnotes
1Homage to Lewis Carroll (The Hunting of the Snark)