Data centers today are like a living organism. They are a complex mix of many elements that either work together or against each other if the correct balance is not engineered. Like living organisms every data center is different. Two data centers may have the same square footage, power, air conditioning, and number of racks but by having different business requirements need different approaches to rack layout, rack density, power distribution and air flow.
A critical question to consider when you have enough headroom for growth is how to configure to maximize longevity while allowing for ease of expansion. In this situation an organization takes time to devise a plan that takes all of the elements above into consideration and develops a long range, board scope plan.
The challenge most often lies in accommodating rapid data center growth while utilizing the same power and air conditioning resources. This is a difficult challenge to solve for many reasons, especially if there is not enough time to develop a good plan. This is a classic case of reactive response rather than proactive planning and is usually brought on by unexpected growth that is sprung on managers of data centers. Some “rule of thumb” strategies that help to avoid or delay the need to enhance power and air conditioning resources and data center size are:
1) Refresh IT equipment
1.1) 4X to 12X performance improvements in the same footprint
2) Consolidation through virtualization
2.1) Virtualization can supply a 5X boost in utilization in the same footprint 2.2) If your business model can take advantage of virtualization, do so. Keep in mind that virtualization does not work for all situations. 2.3) Most enterprise servers are under utilized
3) Densification
3.1) 2X improvements by carefully choosing the right server
4) Think out of the box
4.1) Assess your current situation 4.2) Identify weak points 4.3) Understand where to take advantage of different strategies 4.4) Act upon those improvement strategies
5) Take an innovative approach to data center design and operation
5.1) Think of ways to make the facility last longer
No single approach works for all data centers. The frequently asked question I get is, “What area of the data center should I attack first”? My answer, “It really depends on the problem areas”. A good return on investment can be gained from applying the “rule of thumb” approaches above. However, sometimes these approaches do not solve the most serious problem that needs to be attacked first. IT professionals usually know which two or three problems in the data center keep them up at night, if they think about data center issues at all. Most often those are the areas that need attention first.
In the situation at Caltech the data center was running at 72F to 76F, a little warm. There was a need to add more processing power and storage to sustain our data processing operations. Caltech facilities suggested adding more computer room air conditioning (CRAC) units. Adding 18 tons of CRAC did not help as much as expected. At that time we also had no chance of getting more power or air conditioning to the data center.
We started researching why our 60KVA load of servers and storage was not being adequately cooled by 44 tons of air conditioning. We also needed to continue to expand our processors and storage. There seemed to be no clear and easy answer regarding how to avoid overwhelming our power and air conditioning resources.
After accessing the entire facility and project needs we started making small changes in many different areas:
- Upgrading older servers with new multi-core servers.
- Consolidating multiple services into fewer servers. In our case virtualization didn’t work well so we ended up running multiple processes on a larger server.
- Upgrading storage servers with dense, power efficient drive solutions with massive array of idle disks (MAID) architecture from NEXSAN Technologies.
- Utilizing hot aisle containment to improve air conditioning and airflow.
Utilizing these four methods we quadrupled our processing power from 36 to 140 cores and doubled storage capacity from 600TB to 1.2PB while reducing the power load by 3KVA! We were able to eliminate 2 CRAC units completely and reduce our air conditioning need to 16 tons while lowering the cold aisle temperature to 61F and venting the hot aisle at 78F.
The 18 tons of CRAC that was added is now used only as a backup. In an emergency the data center can operate on as little as 8 tons of AC while maintaining 78F cold aisle and 90F hot aisle temperatures. If the hot aisle temperature exceeds 85F a system of fans evacuate hot air into the ceiling plenum. When the plenum is pressurized with this hot air the air is pushed out of a rooftop CRAC unit’s fresh air duct.
Ironically, while I was on the Wikibon’s Peer Incite call I got an email stating that the AC maintenance people were here and wanted to take the primary AC units down for maintenance. They took the primary unit down for 30 minutes and the data center cold aisle temperature didn’t go above 63F and 82F in the hot aisle. The outage was apparently not long enough for the backup AC to kick in.
Action Item: Action item: The approach to solving any data center problem involves half science and half art. There is no “silver bullet”. Avoid falling into expensive sales pitch, “cookie cutter” solutions. Be creative and think outside the box. Assess your specific situation, identify the main weak points and remediate those while keeping the big picture in mind.
Footnotes: