Tip: Hit Ctrl +/- to increase/decrease text size)
Storage Peer Incite: Notes from Wikibon’s April 6, 2010 Research Meeting
In this economy, with super-tight budgets, CIOs can use a little good news. So here it is: A little creativity and common sense can save both CapEX and OpEx in the data center. That is the message that Eugean Hacopians, senior systems engineer at CalTec and now also principal at ANRE Technologies, brought to the Wikibon community in the latest Peer Incite Meeting. And Hacopians knows -- he has created a consulting business out of his experience.
One mistake that data center managers often make, he says, is to look at the problem in pieces. A data center is an environment that must be seen holistically.
For instance, the arrangement of equipment on the floor can make a huge difference in overall efficiency. IT equipment actually takes up only part of the space and power in the data center. The rest goes to support equipment such as power distribution and cooling. He suggests that the data center be physically reorganized to create a "cool aisle" containing the actual IT equipment and a "hot aisle" containing everything else. Virtualization and upgrade to higher capacity blades can allow further compression of the physical space the IT side needs.
At CalTec this approach segregated all the actual compute and storage into half of the building. By building a non-load-bearing wall down the middle of the room and ducting the cooling to where it was needed while allowing temperatures in the "hot aisle" to rise to 78 degrees, he was able to actually cut cooling requirements significantly while increasing processing capacity to meet growing demand. The trick, he said, was to extend the physical separation between the two aisles below the raised floor to control air flow.
The articles below provide details on other common sense approaches to consider to extend the life of the data center while decreasing CapEx and OpEx. G. Berton Latamore
Data center modernization projects are onerous and expensive. Using common sense approaches to airflow, equipment layout and overall design can increase efficiency, sometimes reduce the power bill and improve PUE. The key is to integrate various disciplines including power distribution, mechanical air flow and knowledge of the application requirements; to retrofit existing infrastructure rather than build out new data center capacity.
This was the message put forth to the Wikibon community on the April 6, 2010 Peer Incite Research Meeting. We were joined by Eugean Hacopians, Senior Systems Engineer at CalTech and Principal at Anre Technologies a data center consultancy.
Hacopians shared his experiences at CalTech and with other operations in which he's helped improve energy efficiency. A key message of the call was that while often vendors will try and sell you the latest and greatest equipment -touting improved efficiencies- it's best to understand your environment and consider practical ways to reduce energy consumption.
For example, Wikibon member Josh Krischer of Josh Krischer Associates shared some metrics about PUE. PUE stands for Power Usage Effectiveness. It is a ratio that measures the total power required for the facility divided by the power required for the IT equipment. A hypothetical value of 1.0 is perfection and unattainable. Krischer's estimates indicate that less than 10% of data centers worldwide have a PUE below 1.5 and on average he sees PUEs of between 2-3. This means that the total data center demand, on average, is 2-3X the demand for IT equipment.
The critical point for users is that replacing, for example, a CRAC unit and installing a more efficient unit, misses a large portion of the problem, namely the IT equipment. Rather if practitioners can find ways to lower the consumption of IT gear, it will have a ripple effect to infrastructure (i.e. power distribution and cooling).
Hacopians indicated that practitioners have three choices when they are out of power, cooling or space:
- Buy more dense equipment ("densify")
- Try to get more power
- Buy more CRAC units
In the case of CalTech, by following a straightforward recipe, the organization was able to completely shut off two of its nine CRAC units. Hacopians and Krischer recommended taking some practical measures - which won't always apply in all cases; including:
- Virtualize servers, storage and networking
- Use denser equipment
- Set up hot and cold aisles
- Avoid hot spots
- Improve air flow beneath raised floors
- Reduce the amount of air that must be cooled
- Plan your rack placement and avoid obstructions as you scale (e.g. columns)
The Human Touch
Hacopians stressed that most managers don't see the problem. Krischer shared some IBM statistics that indicate worldwide, less than 25% of data center managers have any control or authority over the power bill. Wikibon members in the U.S. indicate the figure is even lower (i.e. less than 10%).
Lack of metering and monitoring systems within the data center make tracking power consumption very difficult. Compounding the complexity of this problem is the fact that installing such systems is disruptive so it's always pushed to the back burner.
Hacopians recommends incentive systems that entice data center managers and employees to find ways to reduce energy consumption. In the case of CalTech, this led to simple and inexpensive fixes.
For example, CalTech's initiative to reduce energy consumption cost the organization $20,000 excluding staff time. The resulting savings amounted to around $2,000 per month according to Eugean. Perhaps more important, the organization has quadrupled its CPU power from 36 to 140 cores and double its storage capacity from 600TBs to 1200TBs-- while holding power consumption steady.
The bottom line is by thinking about what's possible before investing too heavily, and viewing the data center as a "living organism," CIO's can improve ROI, shorten breakeven periods and hold the power bill steady.
Action item: Practitioners are often under the gun to address power pain points. CIO's should avoid the easy fix of installing more energy efficient CRAC and power distribution units without first going through the exercise of trying to reduce the energy consumed by IT equipment. Lowering the consumption of IT gear it will have positive ripple effects throughout the data center infrastructure.
Footnotes: Josh Krischer Data Center Efficiency Report
Data centers today are like a living organism. They are a complex mix of many elements that either work together or against each other if the correct balance is not engineered. Like living organisms every data center is different. Two data centers may have the same square footage, power, air conditioning, and number of racks but by having different business requirements need different approaches to rack layout, rack density, power distribution and air flow.
A critical question to consider when you have enough headroom for growth is how to configure to maximize longevity while allowing for ease of expansion. In this situation an organization takes time to devise a plan that takes all of the elements above into consideration and develops a long range, board scope plan.
The challenge most often lies in accommodating rapid data center growth while utilizing the same power and air conditioning resources. This is a difficult challenge to solve for many reasons, especially if there is not enough time to develop a good plan. This is a classic case of reactive response rather than proactive planning and is usually brought on by unexpected growth that is sprung on managers of data centers. Some “rule of thumb” strategies that help to avoid or delay the need to enhance power and air conditioning resources and data center size are:
1) Refresh IT equipment
1.1) 4X to 12X performance improvements in the same footprint
2) Consolidation through virtualization
2.1) Virtualization can supply a 5X boost in utilization in the same footprint 2.2) If your business model can take advantage of virtualization, do so. Keep in mind that virtualization does not work for all situations. 2.3) Most enterprise servers are under utilized
3.1) 2X improvements by carefully choosing the right server
4) Think out of the box
4.1) Assess your current situation 4.2) Identify weak points 4.3) Understand where to take advantage of different strategies 4.4) Act upon those improvement strategies
5) Take an innovative approach to data center design and operation
5.1) Think of ways to make the facility last longer
No single approach works for all data centers. The frequently asked question I get is, “What area of the data center should I attack first”? My answer, “It really depends on the problem areas”. A good return on investment can be gained from applying the “rule of thumb” approaches above. However, sometimes these approaches do not solve the most serious problem that needs to be attacked first. IT professionals usually know which two or three problems in the data center keep them up at night, if they think about data center issues at all. Most often those are the areas that need attention first.
In the situation at Caltech the data center was running at 72F to 76F, a little warm. There was a need to add more processing power and storage to sustain our data processing operations. Caltech facilities suggested adding more computer room air conditioning (CRAC) units. Adding 18 tons of CRAC did not help as much as expected. At that time we also had no chance of getting more power or air conditioning to the data center.
We started researching why our 60KVA load of servers and storage was not being adequately cooled by 44 tons of air conditioning. We also needed to continue to expand our processors and storage. There seemed to be no clear and easy answer regarding how to avoid overwhelming our power and air conditioning resources.
After accessing the entire facility and project needs we started making small changes in many different areas:
- Upgrading older servers with new multi-core servers.
- Consolidating multiple services into fewer servers. In our case virtualization didn’t work well so we ended up running multiple processes on a larger server.
- Upgrading storage servers with dense, power efficient drive solutions with massive array of idle disks (MAID) architecture from NEXSAN Technologies.
- Utilizing hot aisle containment to improve air conditioning and airflow.
Utilizing these four methods we quadrupled our processing power from 36 to 140 cores and doubled storage capacity from 600TB to 1.2PB while reducing the power load by 3KVA! We were able to eliminate 2 CRAC units completely and reduce our air conditioning need to 16 tons while lowering the cold aisle temperature to 61F and venting the hot aisle at 78F.
The 18 tons of CRAC that was added is now used only as a backup. In an emergency the data center can operate on as little as 8 tons of AC while maintaining 78F cold aisle and 90F hot aisle temperatures. If the hot aisle temperature exceeds 85F a system of fans evacuate hot air into the ceiling plenum. When the plenum is pressurized with this hot air the air is pushed out of a rooftop CRAC unit’s fresh air duct.
Ironically, while I was on the Wikibon’s Peer Incite call I got an email stating that the AC maintenance people were here and wanted to take the primary AC units down for maintenance. They took the primary unit down for 30 minutes and the data center cold aisle temperature didn’t go above 63F and 82F in the hot aisle. The outage was apparently not long enough for the backup AC to kick in.
Action item: Action item: The approach to solving any data center problem involves half science and half art. There is no “silver bullet”. Avoid falling into expensive sales pitch, “cookie cutter” solutions. Be creative and think outside the box. Assess your specific situation, identify the main weak points and remediate those while keeping the big picture in mind.
Despite a decade or more of green-IT initiatives, data center power and cooling challenges are increasing. Server virtualization and storage consolidation may have reset the baseline for the number of installed servers and storage systems, but application growth continues, utilization rates for both processors and storage are up, and the total power used, inclusive of cooling systems and power distribution systems, can be 2-3X what is required of the IT hardware alone. This results in increased power and cooling requirements, when measured on a cost-per-cubic-foot basis. Ultimately CIOs must face the very real threat of data centers being out of available power and out of available cooling.
The options, when faced with out-of-cooling and out-of-power conditions, are numerous and include:
- Build new data centers
- Re-engineer power and cooling systems
- Refresh hardware to take advantage of more energy-efficient systems
- Expand the scope of server-virtualization and storage-consolidation initiatives
- Employ job scheduling to reduce peak load
Many organizations lack three things necessary to make informed, financially-responsible decisions:
The information gap stems from an absence of good measurement. Measurement needs to encompass not only the consumption of power by IT systems, but also the consumption by power distribution systems, battery backup systems, and cooling. The facilities manager often measures power going into the data center complex, but they frequently do not charge CIOs for the power and cooling consumed.
The structural challenge stems from the fact that often the facilities manager holds the budget for data-center buildings, power-systems, backup-generators, battery backup systems, and HVAC, while the CIO holds the budget for IT systems and software. CIOs don't have full budget responsibility to which they can be held accountable. If the cost of buildings, power, and cooling are in another budget, one of the options for a CIO is to make power or cooling someone else's problem.
Process can overcome some organizational structural challenges. If budgets for facilities, power, and cooling are separate from IT budgets and not allocated to the data center, the process solution is to rest the responsibility in the lap of the CFO, to whom the facilities manager and the CIO often report. The CFO can ensure, before building new data centers, investing in new HVAC systems, or upgrading power systems, that the CIO has examined all options relative to data center design and upgrades to more energy-efficient systems.
Action Item: Ultimately the responsibility for making sound financial decisions rests with the CFO. The CFO should establish three priorities in order to drive better overall data center efficiency:
- Implement measurement and chargeback for facilities, power, and cooling
- Reorganize budgets to empower CIOs and hold them responsible for the full operational cost of IT systems
- Establish processes for better collaboration between Facilities Management and Data Center Managers to ensure that IT options are evaluated in the context of power and cooling efficiency.
As any savvy CIO, data center (DC) manager or vendor knows, the increase in and consolidation of compute power, along with the growth of additional IT assets populating the data center (DC), has created a major spike in power demand. CFOs and facilities managers need only look at their electric bills for verification. Generally accepted industry estimates for DC power consumption conclude that more than 50% of the bill is dedicated to the cooling infrastructure.
Meanwhile, the world governing body for air-conditioning practice, ASHRAE (American Society for Heating, Refrigerating and Air-conditioning Engineers), published data suggesting air-cooled mainframes and racks have heat loads in the range of 500 – 1500 W/ft2, and with the advent of new server designs that include thin server equipment of 1 ¾” high, a typical 84-inch cabinet can hold up to 40 of these, and the heat load per cabinet could be as high as 10,000W.
Greening the DC
Enterprise IT vendors such as Dell, HP and IBM have made major strides in reducing server and storage power consumption with help from Intel and AMD. IT vendors have also developed partnerships with CRAC (Computer Room Air Conditioning) manufacturers such as APC and Liebert along with deploying IT asset management software in conjunction with facilities management tools as well as deploying virtualization technologies such as VMware to assist in monitoring, regulating and ultimately lowering power consumption in all corners of the DC.
The DC as Holistic Organism
Needless to say, the calibration of so many different components that cross over multiple cost centers, including IT and facilities, is no easy task. DC vendors are learning how to bridge the gap between cost centers and navigate the many persistent political mine fields. However, IT is often its own worst enemy when it doesn’t view DC costs in a comprehensive or holistic way. The same can be said of vendors who turn a deaf ear to their client’s unique requirements instead of helping the customer determine the best solution for their needs.
As Wikibon contributor Eugean Hacopians of Caltech detailed for participants during his Peer Incite presentation April 6th, “Every computer room is different” He suggests that each DC is a living organism that needs to be assessed individually. “By being creative lots of unneeded equipment can be eliminated, often inexpensively with excellent ROI”. Eugean also gave practical, hands-on advice on how to improve the longevity of DC assets. Unfortunately he also bemoaned the fact that “too many vendors don’t listen to your problem and just try to sell their solution. If vendors don’t have a stake in the outcome they don’t take it seriously. This comes through loud and clear to a buyer. It’s like saying ‘If I can’t make good money from you I’m not interested.’ Vendors are better off listening, learning, offering suggestions and if there’s not a profitable relationship to be had then move on. But don’t advertise the fact that you don’t care.”
Data center technologies and best practices continue to evolve as improved energy efficient IT assets come to market and facilities requirements rapidly change. However, customer internal politics and vendor sales practices are slower to change. IT and facilities management will eventually yield to the higher logic of improved ROI and lower operating costs. Greener computing is an additional benefit that can be measured well beyond the walls of the DC. It is clear that some vendors have modified their messages and go to market strategies while others still lag behind.
Action item: Vendors need to continue to develop and roll-out data center solutions and strategies that view the DC as a living, evolving, holistic entity. In addition, vendors must offer customers useful ROI tools and metrics to assist both IT and facilities management to more easily justify the combining of DC budgets wherever it makes sense along with providing the maximum incentive for bridging political differences while recognizing and attempting to meet each clients unique requirements.
Increased sensitivity to data center energy efficiency, continued volatility around energy prices and economic pressures continue the scrutiny over data center costs and the tie to environmental consumption. In addition, as heat densities for IT equipment increase, more and more legacy data centers are up against the threshold of energy capacity and floorspace. So what do these realities mean to the CIO?
- The CIO in most organizations is not the owner of data center energy consumption, costs, planning, or strategies.
- The CIO is the buyer or the broker of IT services for the application owners and the information managers.
- Data center inefficiencies pass from the CTO to the CIO and ultimately to the application owners and business unit executives who foot the bill.
- Ultimately those data center inefficiencies impact product costs and squeeze margins.
It's a simple equation with what seems to be an obvious conclusion -- data center costs have a direct impact on enterprise competitive stance. But in too many organizations, data center power consumption is not metered separately, creating a disconnect. After all, the CTO can’t improve what isn't being measured. And if you can’t measure, you can’t pass the improvements to the CIO, application owners, the businesses, and ultimately the customers.
Give a Little, Get A Lot
Measurement is resisted because it sometimes requires downtime and other interruptions that inconvenience end-users and impact the business. But for CIOs and their end-users, measurements can have big payoffs. Give a little, get a lot. And part of the challenge to the CIO in getting application heads fully behind the value with downtime and other costs associated with improving data center efficiency through measurement is getting everyone understand the standards, the milestones, and the payoffs – reducing power consumption, increasing data center efficiency, and in the end increasing the competitiveness of the business.
Action item: CIOs need to get a handle on data center power consumption before it becomes a critical issue to application owners and business managers. Start by getting a copy of the energy bill and estimating the percent consumption by application and forecast consumption given database, application, and end-user growth estimates. Work with your CTO to develop plans and practices to manage energy thresholds at an application-level on an ongoing basis, with the goal of passing these efficiencies on the business managers through improved margins and higher customer satisfaction.
By focusing on improving the efficiency of IT equipment and looking at the data center as a "holistic organism," Caltech was able to eliminate two CRAC units completely out of the data center. Now operating on one 16-ton, chilled-water CRAC, the organization was able to re-deploy an existing CRAC infrastructure - an 18-ton unit - as backup.
Eight tons of the 18-ton unit is a portable DX unit and is set up to switch on automatically if the temperature goes above 70 degrees Fahrenheit in the cold aisle. If the temp in the hot aisle goes above 85 degrees Fahrenheit, a system kicks in to route the hot air out of the building using the rooftop CRAC unit’s fresh air duct.
By focusing on improving the efficiency of IT equipment, Caltech saw ripple effects on the cooling infrastructure-- i.e. less IT gear to cool means lower cooling requirements. All too often, organizations buy into vendor sales pitches that existing cooling infrastructure is inefficient and needs to be upgraded. But solely focusing on cooling and power infrastructure misses the most important part of the problem. Practitioners should develop efficiency strategies understanding the opportunities to improve IT equipment which will in turn lower cooling and power demands.
There are many ways to deal with IT equipment inefficiencies but no one silver bullet. Techniques include but are not limited to:
- Virtualizing Servers and storage,
- "Densifying" racks,
- Refreshing technology using more energy efficient equipment,
- Redesigning rack and equipment,
- Optimizing workload distribution,
- Making air flow work better,
- Creating hot/cold aisles and containing hot and cold air.
All these factors can improve PUE and get rid of inefficient equipment and supporting infrastructure.
Every computer room is different. Eugean Hacopians of Caltech suggests that an IT center is a living organism that needs to be assessed individually. By being creative lots of unneeded equipment can be eliminated, often inexpensively with excellent ROI.
Action item: By focusing on power and cooling infrastructure only, practitioners will miss the most important part of the IT equation-- IT equipment. Organizations need to look at data centers as a type of living, breathing organism and identify the areas that will give the most bang for the buck. This means starting with IT equipment which will in turn reduce requirements for power and cooling infrastructure.