Tip: Hit Ctrl +/- to increase/decrease text size)
Storage Peer Incite: Notes from Wikibon’s January 12, 2010 Research Meeting
Large SANs are complex, heterogeneous, and often highly performance-oriented. But SAN managers are handicapped by insufficient SAN management tools from the vendors that present fragmented and insufficient information. Meanwhile, their opposite numbers in LAN and WAN management have been using increasingly unified monitoring and management tools from specialized vendors that are designed to provide an overview activity across the network.
The issue is one of organization and focus. When users complain about poor performance from applications on the LAN, IT looks both at the server the app runs on to see if it is underpowered and at the network itself and the nature of the traffic it is carrying to see if the problem is there. Upgrading a server to fix a problem that actually is in the network is wasted money. When users complain about slow I/O times from the SAN however, SAN managers tend to look only at the storage systems. Typically they have no tools to look at the traffic over the SAN network, nor does anyone think about that. But a SAN is a network, just like a LAN or a WAN, and they require network management tools to provide vital information to SAN managers.
This is the insight that guided SAN Manager Ryan Perkowski to new answers to his performance problems. Faced with a choice of investing heavily in a major SAN upgrade to fix several apparent network performance issues, he decided first to find out what was really going on on the network. So he invested in two network management tools.
The first thing he discovered was that one of the databases on his SAN was sending out query responses in 1 GB blocks. By reducing block sizes he was able to clear up one major network performance issue while providing better response times overall at virtually no cost.
The second thing he discovered was a huge amount of user-generated network traffic in the 6 p.m. to midnight time slot, which the database vendor had assured him should be the lowest volume period. This mystery traffic was preventing nightly backups from completing on time. He cleared up the mystery by talking to the users generating this traffic. He discovered that they habitually send in large batch requests just before they leave at 6 p.m., so that the data they need the next day will be on their workstations in the morning. The answer was not to upgrade the SAN but rather to reschedule the backup for after midnight, when the network traffic drops to a much lower level.
The point, Perkowski says, is that SAN vendors are focused on developing storage technology. For them management tools are an afterthought, and they are only interested in supporting management of their components. Network management vendors, on the other hand, are focused on the issues of managing heterogeneous data networks as well as possible.
Any network manager will admit that these tools are not perfect. But no network manager today would try to run a large LAN without at least one (and often several). SAN managers need to take a long look at the independent management tools available and not just automatically increase network capacity every time they start getting complaints from users. What they discover may well save the organization several times the cost of these tools. G. Berton Latamore
Today's SAN performance and management tools from array vendors fail to provide the requisite heterogeneity, metrics, and interoperability to enable SAN managers to effectively manage performance. And virtualization exacerbates their inadequacies.
Perkowski shared with nearly 60 Wikibon members in attendance how he and his team are able to dramatically improve processes around SAN performance management by gaining better visibility through the use of specific SAN tooling from Virtual Instruments and NetApp. It is Perkowski's contention that while these tools were initially difficult to justify, their payback has been substantially higher than expected.
Perkowski's firm is a Cisco shop, and he manages a SAN environment that is about 75% AIX systems with some Windows-based VMware hosts for test and development applications. His firm has grown storage capacity from 30TB to 450TB in three years. Driving that growth has been the acquisition of new credit card customers and new accounts as his firm has gained share in the U.S. market. The key applications in the shop are analytics and data warehouse systems based on a 20TB Oracle warehouse and other warehouses including a large SAS instance. Ryan manages a combination of EMC DMX-3's and DMX-4's as well as other NAS storage.
Whales Floating Through the SAN
According to Perkowski, these large warehouse applications are like a ‘whales floating through the SAN.’ They caused frequent and intermittent performance bottlenecks that were difficult to pinpoint, complicated by the fact that the organization leverages the virtualization capabilities of AIX.
The firm was experiencing major SAN performance headaches and what Perkowski referred to as ‘gray performance issues’, meaning that the root cause of the problem was difficult to find. These sporadic and unpredictable slowdowns led to a standard operating procedure that when a performance problem occurred the SAN got blamed.
Like virtually all high-performance environments, Ryan had to over-design the SAN in order to accommodate these fluctuations in performance. The challenge increasingly became tackling the endless ‘dark art’ of SAN management which required an unproductive set of activities. Perkowski at one point described this as ‘grabbing and shaking the crazy black 8-ball’ to try and find answers. Clearly the organization was struggling with this problem, especially given its high rate of storage growth.
Visibility, Metrics and Trending
Metrics Perkowski was able to gather from his EMC array-based tools were limited to parameters such as cache hit rate, spindle response times, and other array-specific data. What he lacked was a fuller picture especially from the perspective of the end-user.
Perkowski initiated a proof-of-concept using NetWisdom from Virtual Instruments and SANscreen from NetApp which it acquired from Onaro. NetWisdom is a dedicated monitoring tool that uses a combination of software and hardware to probe the storage network and in particular the components that are problematic, in this case the Oracle data warehouse infrastructure. SANscreen is a heterogeneous service management suite which, among other things, describes the relationships between a particular application on a given server and its data on a storage device.
The combination of these two tools allowed Perkowski to immediately receive an avalanche of useful metrics about his SAN. Initially he was but in a short timeframe, by accessing trending data on metrics such as MB/sec, CRC errors, log ins, and log outs, he was able to either confirm or eliminate storage as the bottleneck in key applications. Perkowski eventually rolled these tools out as fundamental components of his infrastructure wrapping software, initially with a single probe around the Oracle data warehouse initially, adding probes into other systems over time.
The results have been a dramatic improvement in problem determination and remediation and have given a credibility boost for the SAN team. Perkowski shared an example where the application developer had suggested the best backup window for a particular system was between 6PM and midnight prior to an automated batch job that kicked off overnight. However, the backup team was unable to complete the job within the prescribed window. Perkowski asked the backup team to refrain from performing the backup the next evening while he performed his analysis. He found that I/O activity on the SAN spiked from 6PM to midnight, the exact times the application developer had said activity would be lowest and best for the backup.
Perkowski went to the user organization and asked a few questions. As it turned out, the users were all queuing up batch jobs just before they left the building, hitting return at 6PM and running their queries into the evening. It was one of the busiest times for the application on the SAN. Perkowski shared with us that he never would have been able to gain the visibility to resolve this problem in the fast timeframe without the third-party tooling.
Justification for SAN Tools
The challenge Perkowski sees for practitioners is that the benefits of tools such as NetWisdom are hard to predict prior to installation. His organization can cite the following areas of improvement:
- Much faster and more accurate problem determination,
- Better capacity planning,
- More efficient provisioning,
- Cost savings through better IT productivity and more efficient use of SAN capacity,
- Substantially better application performance predictability and quality of service,
- Elimination of existing licenses and maintenance fees for array-based management software.
The challenge for practitioners is they have no way of knowing the degree to which these tools will save money and improve service levels until they run a proof of concept. In the case of Ryan’s firm, the target applications are revenue generators (e.g. credit card transaction enablers). Users need to understand that the required components of such tools roughly approximate to the following:
- Software costs (~$50,000 per probe)
- Splitter costs (~$300/port),
- Server capacity to run the software,
- Additional disk capacity to house the tools and analysis,
- Time to install (a few days).
In the case of Ryan’s firm the total costs roughly equate to $500,000 to accommodate about half of his 450TB of capacity-- the high performance half. This equates to roughly $2,000 per TB. The benefits easily outweigh the costs according to Perkowski but he had no way to know this going in. As such a combination of existing pain, faith and intelligent proof of concepts will reduce risk for SAN managers and result in potentially substantial benefits.
Action item: Most array-based SAN management tools are deficient in their ability to detect and help remediate storage network performance problems. Organizations in high performance, high growth SAN environments should evaluate heterogeneous tools such as Virtual Instruments NetWisdom which can provide valuable metrics, trending data and end-to-end visibility on performance bottlenecks. Tools like NetApp's SANscreen are complimentary and can simplify change management and capacity planning. The ROI of these tools will be a function of the size of the SAN, its growth rate, and the value of applications to the business.
Because of the limited adoption of storage area network (SAN) monitoring and reporting tools, storage capacity and performance management has, for years, been a dark art. As a result, when storage administrators says to their CIOs that they "need more storage for the SAN," they could be proposing a solution to a problem that, at best, they do not fully understand, and, at worst, may waste money and fail to meet the performance needs of the application owners.
A storage area network (SAN) is more than the amount of data the SAN can hold. A SAN includes storage, switches, and adapters. In addition, each component comes with a host of configuration options, which may impact performance and available capacity. In order to ensure that data is delivered to applications when needed, storage administrators need the tools necessary to monitor, track, and forecast key metrics of storage capacity and SAN performance.
A storage area network (SAN) is a system. If a SAN were a transportation system, the storage might be the passenger bus. A bus holds a certain known capacity, has the potential to travel at a known maximum speed, can be boarded (filled up) and exited (emptied) at a known rate. The job of the bus is to get people and packages from location A to location B on time, without losing them or damaging them. And when the bus is full, it's full. If in a given amount of time you have more people to move than the bus can hold, you will need another bus. But the ability of the bus to perform the people-delivery service is also dependent upon the health of the bus, the quality of the roads, the speed limit, the amount of traffic on the road, and the optional routes, should the normal route become unexpectedly congested.
So how do transportation administrators know when they need more buses? Unfortunately, a picture of a full bus tells a transportation director very little about the ability or inability of the transportation system to deliver passengers on time and whether he needs more buses. The same holds true for SAN administration. A picture of a full storage system tells the CIO and the storage administrator very little about the ability of the SAN to meet the needs of the applications it supports.
Much has been made of the exponential growth of storage, but Ryan Perkowski found, through SAN monitoring and reporting tools, that his company's capacity requirements were actually growing on a linear basis, not exponentially. At the same time, I/Os per second (IOPS) were, in fact, growing exponentially. The information contained in his Virtual Instruments™ NetWisdom™ reports helped his company avoid over-provisioning of storage capacity, and enabled him to focus instead on providing the required storage performance and capacity. Just as importantly, the historical reports on performance trends enabled his company to identify impending performance bottlenecks and prevent them. The point-in-time capacity and performance snap-shots that are the limit of many storage and switch vendors' reporting capabilities are inadequate for intelligent SAN management.
The adoption of storage tiers is expanding, and storage administrators are increasingly looking for Tier-0, solid state disk, to address their performance, not their capacity requirements. Investments in Tier-0 storage are substantially easier to understand and justify when viewed in the context of robust SAN reports of both capacity and performance. But if the real problem is in the network rather than the storage systems, then that investment may not remedy the problem.
Action item: CIOs should place a priority on investments in monitoring and reporting software for storage area networks, to enable better SAN management and predictive performance analysis. In times of limited budget, these investments should have a higher priority than investments in inadequately justified infrastructure. At the same time, CIOs should develop a richer shared vocabulary with their storage administrators to better describe, understand, and justify future investments in storage systems and storage network infrastructure. Virtual Instruments' NetWisdom is one of the reporting tools used by Ryan Perkowski. Virtual Instruments' tools are targeted at meeting the SAN monitoring and reporting needs for large, mission-critical SANs.
Performance SANs are there to provide optimal performance for mission critical workloads and need to be managed as a whole. The switches and storage arrays that comprise the SAN each have data collection capabilities but do not provide an end-to-end view. The most logical and cost-effective place to tap into the key performance data of a SAN is just before the storage ports. Wikibon recommends that performance SANs be enabled to collect end-to-end performance data by putting in splitters between the storage array ports and the rest of the SAN. Splitters(or TAPs, Traffic Access Points) are passive devices that take a percentage of the light from a fibre cable that can be connected to performance data analyzers. The cost is approximately $300 per storage port and should be built in to the purchase and operating procedures for all performance SANs.
The steps to fitting splitters are:
- Determine if multi-pathing has been correctly set up for each port, and ensure that any fail-over infrastructure that should have been there is actually there!
- Retrofit splitters/TAPs to all existing storage ports (non-disruptive if step 1 housecleaning has been done).
- Set in place purchase and installation procedures and training to ensure that splitters are installed as standard with every new array.
In the 1/12/2010 Peer Incite, Ryan Perkowski talked about the importance of choosing best-of-breed tools for the SAN as a whole. For the end-to-end performance analyzer, he chose Virtual Instruments, which provides the NetWisdom probes that record and store all the data from the splitters and other SAN components, and enable this data to be analyzed and correlated. For overall SAN management, he chose the SANScreen product from NetApp, and for array management he chose the tools from the array vendor.
Action item: Senior IT management should identify the performance SAN infrastructure, ensure that effective end-to-end SAN performance tools are enabled by putting in fibre splitters as standard, and use best-of-breed end-to-end SAN performance tools to manage SAN performance.
Ryan Perkowski has 450 Terabytes installed. Half of it (225TB) is performance and availability critical. At today’s storage prices for Tier 1 storage, that is a current storage value of less than $2 million on the floor. So why did Ryan pay $450K for Virtual Instrument tools to monitor the SAN?
The method that Ryan uses to convince his customers is to properly cost out the all inclusive cost per TB that the end-user pays. The cost to his business customers for mission-critical tier-1 storage is $60,000/TB, ten times the purchase price. It includes the costs of backup and recovery, performance and availability assurance, additional copies, the storage network, compliance, storage staff and the monitoring tools. Sure, if the project will work with tier-2 storage without the frills, go for it. But the cost of supplementing the services if they are actually required will be much higher for the project team than taking standard storage services. If a performance SAN is required, the cost of the monitoring software as a proportion of the total cost is small.
What are the benefits of performance knowledge? There are four levels of justification:
- Cost Avoidance – This means that storage that was going to be bought to solve a perceived I/O problem that was not actually a storage problem is not bought. In Ryan’s case, he avoided having to upgrade a EMC DMX3 to a DMX4, because the Virtual Instrument probe provided detailed and correlated historical information that showed reducing the block size at the database level would solve the performance problem. This more than paid for the whole installation. By knowing the end-to-end performance characteristics across the SAN he avoided the cost of over-provisioned storage “just in case”.
- Time to Solution: By knowing that the problem was not in the SAN, and by providing a wealth of information to help identify the server-side problems, projects can be rolled out quicker. The tool is being used by the database groups to help them implement better solutions and solve problems faster. The price of additional probes is included as part of the development cost.
- Rationalization of Storage Software on the SAN Components, particularly at the switch and array level: Each component should have the tools to manage operations, but end-to-end management should be left to best-of-breed heterogeneous tools that take the all the data from all SAN components and correlate it historically. Ryan is in the process of removing some of the storage management software and saving a bundle.
- A Deeper Understanding of Storage Performance Trends: In Ryan’s case it shows that data growth has gone from exponential to linear, but that access density is going through the roof. The data is being exploited much more heavily. Ryan is in a position to know that he has to think about performance storage architectures that are capable of delivering much higher levels of IOPS/TB than the current architecture can provide. He has the data and charts to show it, and the confidence of senior IT management that there is value in exploiting the data and that there is a justified price to pay for improved storage IOPS performance.
So are SAN end-to-end performance tools the solution to all performance problems? They do have profound limitations. They provide a snapshot of the SCSI conversations from HBA to-and-from disk. If a component of the SAN does not provide detailed data, those data correlations will be missing. The SAN tools do not provide an application-view of the performance and whether the storage system is meeting the SLAs for that application. In a virtual server environment, they do not show the relation between the I/Os and the virtual machines from whence they came.
This data is required in the open systems arena. But developing it requires a new management model and new standards. Companies such as EMC are attempting to introduce these models and tools in VMware, and the very recent partnership between HP and Microsoft claims to be aiming to solve the same problem.
Currently it requires an army of experts to solve a deep performance problem in a virtual machine environment. The creation of more proprietary stacks should eventually provide better end-to-end tools and reduce the size of the army. Eventually the tools may be automated and eliminate the army altogether.
Action item: While we are waiting for nirvana, IT storage managers and senior IT management with high-performance SANs would do well to follow Ryan’s philosophy. Focus on a very few best of breed third party tools for end-to-end SAN management, and use vendor tools for within component management. And get rid of everything else.
Many organizations recognize their SAN as a necessary evil. It is a huge performance boost to I/O intensive applications, but it is also considered a ‘dark art’ because of a lack of knowledge in how to troubleshoot or optimize it. As a result, SANs across the globe have been over-provisioned, or over-architected to ensure predictable performance. This is music to the ears of any storage vendor sales person, but a painful pill to swallow as a customer, and the customers are getting wise to this.
OEM storage tools, on the whole, are very good at managing the storage but are horribly lacking when it comes to showing performance or troubleshooting metrics. If vendors want to really prove they are the best game in town, show customer this by letting them form their own conclusions based on hard data. Give customers an in depth and intuitive performance tool. Give them the ability to trend performance data over time, and show them where performance tweaks can occur.
As a customer, being told that you do not have to buy more HBAs to gain bandwidth or short stroke a RAID group to gain IOPS is an immediate save. Getting more out of your storage investments allows you to justify buying more of that type of storage. It may sound counter-intuitive at first, but consider the following: You buy a new car for its ability to haul ‘stuff’. You did your research, and learned that this car holds 64 cubic feet of ‘stuff’. Now, what if that car dealer taught a class on ‘how to pack your car properly’, allowing you to get a little more out of the space you already have, rather than trying to sell you a second car to haul more stuff. What if instead of a ‘hauls a lot of stuff’ car, you bought a ‘goes fast’ car. Imagine being given instruction on how to take corners at speed, how to heel-toe shift, etc. Even if your car was not the fastest, you would know how to get the most out of it, and you would probably become a repeat customer at that dealership.
If storage providers would spend a little more time presenting performance and utilization metrics in an easy-to-use and intuitive format, it would empower their customers to get the most from their product. No longer would the customer feel like they were throwing money into their SAN money pit. When the time came to actually buy more hardware, the customer would know they are making the right decision, and what is more important is that this purchase will solve their needs the first time.
A ‘pie in the sky’ solution would be for storage vendors to capture useful metrics from not only their storage frames, but also from the switches they sell to interconnect your SAN into a single performance tool. At the very least, storage vendors should work to standardize a common set of performance and troubleshooting metrics so other products could pool those together into a meaningful performance tool.
Action item: Vendors should consider showing their customers how to optimize storage versus keeping SAN a dark art as a means to sell more storage. While a single deal situation might lead to less storage being sold, doing so will lead to higher customer loyalty and a longer relationship with that customer.
In the world of high-performance highly available applications, the end-user experience and meeting SLAs are of paramount importance to the IT team responsible for supporting high transaction volume systems that are critical to maintaining and driving business. Meanwhile, the explosive growth of data is straining information infrastructures and putting added pressure on IT to maintain or even improve performance levels while attempting to avoid increased storage and switching costs wherever possible as long as service levels are not compromised.
This scenario was well portrayed by Ryan Perkowski, a 10 plus year financial services SAN-management veteran who shared his experiences, expertise and opinions with the Wikibon community during a January 12th Peer Incite research meeting. Ryan walked participants through his present storage environment, which has grown from 30TB to over 450TB in the last few years. The most mission critical applications include a SAS analytics instance and a customer data warehouse totaling roughly 20TB running on an Oracle database. Because his environment is overwhelmingly populated with Cisco MDS 9000 multilayer directors and fabric switches as well as EMC DMX-3 and DMX-4 storage arrays, he initially utilized the [EMC Control Center (ECC) platform] to minimize the number of applications needed to help manage the SAN.
Problem with SAN Management Platforms
While Perkowski still needed the core array tools provided by EMC to do mapping, masking, and zoning, the goal of a acquiring a product to bring together all the necessary reporting, monitoring, and SAN management capabilities into a “single pane of glass” in his complex environment ultimately proved to be an illusion. In particular, ECC lacks the ability to provide IT with an end-user-centric view.
Perkowski turned to Virtual Instruments and its NetWisdom family of SAN I/O performance monitoring and SAN troubleshooting products and also to SANscreen from NetApp, which it acquired from Onaro. NetWisdom is a dedicated monitoring tool that uses a combination of software and hardware to probe the storage network and in particular the components that are problematic, in this case the Oracle data warehouse infrastructure. SANscreen is a heterogeneous service management suite which, among other things, describes the relationships between a particular application on a given server and its data on a storage device.
As a result of bringing in Virtual Instruments, and NetApp’s SANScreen, two best of breed tools, large shops like Perkowski’s are able to get critical SAN performance data, obviating the need for ECC and other aggregation tools with diluted functionality. Unfortunately in this instance, the goal of a single pane of glass ultimately results in a watered-down version of everything.
Action item: Large, complex SAN environments should get rid of unneeded management software that doesn’t interoperate well and doesn’t solve the problem. Get rid of bloated, poorly integrated management software and replace it with best of breed point solutions that will drive higher ROI.