Opening Pandora's Box of SAN Management: A Practitioner's View

From Wikibon

Revision as of 00:01, 20 January 2010 by Wikibon Daemon (Talk | contribs)
(diff) ??? Older revision | Current revision (diff) | Newer revision ??? (diff)
Jump to: navigation, search

Today's SAN performance and management tools from array vendors fail to provide the requisite heterogeneity, metrics, and interoperability to enable SAN managers to effectively manage performance. And virtualization exacerbates their inadequacies.

This was the assertion of a SAN practitioner at a major US credit card firm, Ryan Perkowski who addressed the Wikibon community at the January 12th Peer Incite Research Meeting.

Perkowski shared with nearly 60 Wikibon members in attendance how he and his team are able to dramatically improve processes around SAN performance management by gaining better visibility through the use of specific SAN tooling from Virtual Instruments and NetApp. It is Perkowski's contention that while these tools were initially difficult to justify, their payback has been substantially higher than expected.

Perkowski's firm is a Cisco shop, and he manages a SAN environment that is about 75% AIX systems with some Windows-based VMware hosts for test and development applications. His firm has grown storage capacity from 30TB to 450TB in three years. Driving that growth has been the acquisition of new credit card customers and new accounts as his firm has gained share in the U.S. market. The key applications in the shop are analytics and data warehouse systems based on a 20TB Oracle warehouse and other warehouses including a large SAS instance. Ryan manages a combination of EMC DMX-3's and DMX-4's as well as other NAS storage.

Whales Floating Through the SAN

According to Perkowski, these large warehouse applications are like a ‘whales floating through the SAN.’ They caused frequent and intermittent performance bottlenecks that were difficult to pinpoint, complicated by the fact that the organization leverages the virtualization capabilities of AIX.

The firm was experiencing major SAN performance headaches and what Perkowski referred to as ‘gray performance issues’, meaning that the root cause of the problem was difficult to find. These sporadic and unpredictable slowdowns led to a standard operating procedure that when a performance problem occurred the SAN got blamed.

Like virtually all high-performance environments, Ryan had to over-design the SAN in order to accommodate these fluctuations in performance. The challenge increasingly became tackling the endless ‘dark art’ of SAN management which required an unproductive set of activities. Perkowski at one point described this as ‘grabbing and shaking the crazy black 8-ball’ to try and find answers. Clearly the organization was struggling with this problem, especially given its high rate of storage growth.

Visibility, Metrics and Trending

Metrics Perkowski was able to gather from his EMC array-based tools were limited to parameters such as cache hit rate, spindle response times, and other array-specific data. What he lacked was a fuller picture especially from the perspective of the end-user.

Perkowski initiated a proof-of-concept using NetWisdom from Virtual Instruments and SANscreen from NetApp which it acquired from Onaro. NetWisdom is a dedicated monitoring tool that uses a combination of software and hardware to probe the storage network and in particular the components that are problematic, in this case the Oracle data warehouse infrastructure. SANscreen is a heterogeneous service management suite which, among other things, describes the relationships between a particular application on a given server and its data on a storage device.

The combination of these two tools allowed Perkowski to immediately receive an avalanche of useful metrics about his SAN. Initially he was but in a short timeframe, by accessing trending data on metrics such as MB/sec, CRC errors, log ins, and log outs, he was able to either confirm or eliminate storage as the bottleneck in key applications. Perkowski eventually rolled these tools out as fundamental components of his infrastructure wrapping software, initially with a single probe around the Oracle data warehouse initially, adding probes into other systems over time.

The results have been a dramatic improvement in problem determination and remediation and have given a credibility boost for the SAN team. Perkowski shared an example where the application developer had suggested the best backup window for a particular system was between 6PM and midnight prior to an automated batch job that kicked off overnight. However, the backup team was unable to complete the job within the prescribed window. Perkowski asked the backup team to refrain from performing the backup the next evening while he performed his analysis. He found that I/O activity on the SAN spiked from 6PM to midnight, the exact times the application developer had said activity would be lowest and best for the backup.

Perkowski went to the user organization and asked a few questions. As it turned out, the users were all queuing up batch jobs just before they left the building, hitting return at 6PM and running their queries into the evening. It was one of the busiest times for the application on the SAN. Perkowski shared with us that he never would have been able to gain the visibility to resolve this problem in the fast timeframe without the third-party tooling.

Justification for SAN Tools

The challenge Perkowski sees for practitioners is that the benefits of tools such as NetWisdom are hard to predict prior to installation. His organization can cite the following areas of improvement:

  • Much faster and more accurate problem determination,
  • Better capacity planning,
  • More efficient provisioning,
  • Cost savings through better IT productivity and more efficient use of SAN capacity,
  • Substantially better application performance predictability and quality of service,
  • Elimination of existing licenses and maintenance fees for array-based management software.

The challenge for practitioners is they have no way of knowing the degree to which these tools will save money and improve service levels until they run a proof of concept. In the case of Ryan’s firm, the target applications are revenue generators (e.g. credit card transaction enablers). Users need to understand that the required components of such tools roughly approximate to the following:

  • Software costs (~$50,000 per probe)
  • Splitter costs (~$300/port),
  • Server capacity to run the software,
  • Additional disk capacity to house the tools and analysis,
  • Time to install (a few days).

In the case of Ryan’s firm the total costs roughly equate to $500,000 to accommodate about half of his 450TB of capacity-- the high performance half. This equates to roughly $2,000 per TB. The benefits easily outweigh the costs according to Perkowski but he had no way to know this going in. As such a combination of existing pain, faith and intelligent proof of concepts will reduce risk for SAN managers and result in potentially substantial benefits.


Action Item: Most array-based SAN management tools are deficient in their ability to detect and help remediate storage network performance problems. Organizations in high performance, high growth SAN environments should evaluate heterogeneous tools such as Virtual Instruments NetWisdom which can provide valuable metrics, trending data and end-to-end visibility on performance bottlenecks. Tools like NetApp's SANscreen are complimentary and can simplify change management and capacity planning. The ROI of these tools will be a function of the size of the SAN, its growth rate, and the value of applications to the business.

Footnotes:

Personal tools