Rapid increases in system size combined with the popularity of multi-core processors is placing increased pressure on interconnect infrastructure to accommodate higher bandwidth and lower latencies. This trend is particularly acute in high performance computing (HPC) top500 use cases where InfiniBand (IB) is the performance leader. Figure one shows that InfiniBand has the highest performance share of interconnect fabrics in the Top500 supercomputers.
InfiniBand is poised to increase penetration not only in traditional HPC environments but also is supporting HPC applications in life sciences, financial services and more traditional commercial industries. A key recent advancement in cost effectively scaling large InfiniBand infrastructure is to optimize the switch configuration by allowing a more flexible allocation of switch ports through a use of high density port implementations on director-class IB switches. We have dubbed this technique Optimized InfiniBand (OIB), which in practice provides the same effective bandwidth as traditional switches at substantially reduced cost.
To date, only QLogic is delivering Optimized InfiniBand switches, although our understanding is that Mellanox and Voltaire (acquired by Mellanox in November 2010) are actively investigating the approach. Our research shows that for most HPC workloads requiring large InfiniBand port configurations, traditional switch configurations that are not optimized consume 33% more IB ports than Optimized InfiniBand switchhttp://wikibon.org/wiki/v/The_Total_Cost_of_Ownership_of_Optimized_InfiniBand_Switches_for_Large_Computer_Networkses. This significantly impacts the total cost of ownership.
This study quantifies the economic impact of Optimized InfiniBand switches as compared to traditional IB switch infrastructure and provides an economic analysis to assist users and OEMs in understanding the potential of this technology.
InfiniBand itself was not initially focused on supporting the needs of the HPC market. The Open Fabric Alliance developed a software stack called OFED. However, further extensions are required to adapt InfiniBand to the very high-speed multi-core processors that now populate the server racks at the heart of supercomputers. One example is TrueScale from QLogic, which is built upon InfiniBand as its physical and transport layers. On top of that TrueScale provides a highly scalable host-based architecture that provides:
- Connectionless, stateless design,
- Congestion detection and avoidance,
- Quality of Service provided for traffic classes,
- Support for tree, mesh, torus, and hybrid topologies.
The main advantage of these extensions is scalability by making the fabric efficient and avoiding bottlenecks in the interconnect fabric that kick in with basic InfiniBand support. Many of these architectural extensions are subtle and do not kick-in until large numbers of nodes are directed at a problem at scale.
Host-Based vs. Adapter-Based High Performance Processing
The traditional method of designing HPC InfiniBand networks was the adapter-based design, where each InfiniBand host channel adapter (HCA) includes embedded processing that runs the communications protocols. In the days of single- or dual-core processors this made sense, as processing the communication protocols in the server would impact server performance.
However, with the advent of modern multi-core processors, the HCA themselves become the bottleneck for most applications. For example an Intel Xeon 5500 (Nehalem) processor issues four instructions per clock cycle, and operates at a clock speed of 3 GHz. As a result, each Nehalem processor has an execution rate that can approach 24 times that of a generic processor engine of today’s offload adapters. This difference in processing power can potentially overload the adapter based/offload processor, making it a ‘bottleneck’ to the host and HPC cluster’s performance. Looking forward to the next generation of Intel and AMD processors, the number of cores will increase dramatically over the next few years; for example, processors using the Intel Westmere chip will allow 72 times more processing than a current HCA.
The solution that best-of-breed InfiniBand architectures such as TrueScale employ is to execute the communications protocol in the server using a “lightweight” protocol that minimizes the memory and CPU cycle overhead on the server.
Thus a Nehalem-based server using a host-based design could process approximately 5 times more messages than the same server with a tradition HCA-offload design. As the number of nodes in a HPC cluster increase, the message rate per server increases. Wikibon believes that for the vast majority of large-scale HPC applications, a host-based design will become the de facto standard over the next two years.
Switching of InfiniBand connections in large supercomputer environments is now at the leading edge of communication technology. One of the innovations of recent years has been the improvement in the flexibility of connecting switches together. In the previous generation of InfiniBand switches, all vendors were using double data rate (DDR) 24-port switches. With that technology the traditional method of building leaf modules was to use half of the ports for external ports and half for inter-switch link within the chassis. The current generation of IB switches uses QDR (quadruple data rate) and has 36 ports; which implies an 18/18 port split.
A recent innovation in building large supercomputer environments is to change the way these ports are configured. For most applications, the bandwidth to the external points (i.e., the supercomputer servers) is not fully utilized. This means of course that the ISL links are also not fully utilized. A leading innovation is to split the switch with 24 ports for external points and 12 ports for ISL. This Optimized IB Switch configuration provides the same effective latency at a much reduced cost. The only provider of the Optimized InfiniBand technology at the moment is QLogic, although it appears that both Mellanox and Voltaire are studying the approach.
The same fundamental 36-port building block used in leaf switches is also used to build core switches. By using OIB technology in core switches, the size of the core switch can be increased from 648 to 864 ports. By realizing that, for most applications the bandwidth to and from the supercomputers is not fully utilized, QLogic has introduced a 24-forward and 12-backwards Optimized IB switch configuration that provides the same effective bandwidth and latency at a much reduced cost. Using the same technology, the traditional 648 port core switch has been stretched to an Optimized 864 ports, which allows fewer core switches to be deployed and much reduced interconnectivity between the edge and core switches; which lowers costs at scale.
This type of approach will work with most applications, although users should expect a 50/50 port split will be more effective for some very bandwidth-sensitive applications.
Wikibon analyzed a 500-server configuration growing to 1,125 servers over three years. The total cost of ownership (TCO) analysis case study shown below concludes that the TCO for the traditional approach is $4.66M or about 37% higher than an Optimized InfiniBand approach, which costs $3.40M. The traditional connectivity cost per server IB port per year is $1,960, and the optimized IB technology cost per server IB port per year is $1,431. Wikibon concludes that the TCO for the traditional IB approach is 37% more expensive than an Optimized approach. The Net Present Value (NPV) of using an Optimized approach versus a traditional solution is $1.12M.
Figure 1 shows the impact of the number of IB ports required to support the case study above.
As indicated, the only situation where this approach does not work is with workloads that will consistently exceed the bandwidth between the edge switches and core switches. Wikibon recommends that senior IT management and IB network managers investigate if the majority of the planned supercomputer workloads can take advantage of optimized IB configurations and include Optimized switch configuration as the default on RFPs.
InfiniBand is rapidly growing as the interconnect of choice for high performance computing. In the HPC space, InfiniBand has claimed 36% of the market, now behind GbE at 58% of the market. IB adoption in the TOP500 supercomputers has been significantly faster than Ethernet.
The use of InfiniBand in enterprise data centers has recently become more significant. In 2008 Oracle Corporation released its HP Exadata Oracle Database Machine, which utilizes InfiniBand as the backend interconnect for all I/O and interconnect traffic. The updated version of Exadata II now uses Sun Computing hardware and continues to use an InfiniBand infrastructure.
In 2009, IBM announced its DB2 pureScale offering, a shared-disk clustering scheme that uses a cluster of IBM System p servers communicating with each other over an InfiniBand interconnect.
In 2010, scale-out high-performance network storage systems such as the IBM Sonas, Isilon IQ, DataDirect and Terascala have adopted InfiniBand as the primary storage interconnect.
Case Study of Total Cost of Ownership
In order to understand the economic impact of Optimized InfiniBand relative to older, conventional approaches, Wikibon built a financial model to compare the technologies in large HPC server environments. The case study scenario we used is a requirement to connect 500 servers using IB in year 1 and grow this to 1,125 servers in year 3. The assumptions are given in Table 1 below.
Table 2 shows a detailed TCO for a traditional IB switching infrastructure. The number of server IB port years to be switched is 2,375 and the cost per server port year is $1,960. The overall cost of the IB connectivity is $4.66M.
- Traditional approach requires more ports for ISL.
- Average cost per port is $1,960 per year.
- 37% more expensive than OIB.
Table 3 gives the TCO for Optimized IB switches. The number of server IB ports years to be switched (2,375) is the same as in Table 2. The overall total cost of ownership for the switches is $3.40M over three years, and the cost per server year per is $1,431. Compared to optimized IB switches, the cost of traditional IB switches is 37% higher.
- Same initial port requirement (2,375) as traditional IB switching.
- Average cost per port is $1,431 per year.
- OIB requires fewer ports for ISL.
- Most applications are excellent candidates for IOB but some will be better served by traditional approaches.
The financial metrics for the case study are given in Table 3. The Net Present Value (NPV) of applying OIB relative to traditional IB switching infrastructure is $1.12M. The first year investment is $63K, the break-even is 13 months, and the return on investment is very high (>600%) and well above any financial hurdle rate.
Conclusions and Recommendations
Wikibon believes that Optimized IB switches are a significant improvement relative to traditional InfiniBand switch approaches, and we expect other vendors to follow QLogic’s lead in this area to avoid over-allocating bandwidth to each individual server.
In addition, Wikibon believes that host-based designs with lightweight messaging protocols will replace the traditional HCA-offload design, using the modern multi-core servers as a faster, lower latency and higher throughput way of designing HPC clusters. With increased processing power and scaling requirements in both HPC and non-traditional InfiniBand use cases we expect OIB switches to become a staple of large computer network configurations.
Importantly, our research shows that most supercomputing workloads will work just as effectively on Optimized IB configurations, at a savings of $1,117 per server; money that could be better spent on enhancing other parts of the computing infrastructure. Optimized InfiniBand switches are a more efficient and cost-effective way of connecting large numbers of servers in most environments. Senior IT management and IB Network Managers should include Optimized IB switches as a default in their RFPs. The exception to this rule is workloads that will consistently exceed the bandwidth between edge switches and core switches.
Key Questions: The following six questions should be asked when evaluating InfiniBand fabrics for your specific environment:
- Is the fundamental InfiniBand switching design HCA-offload based or host-based?
- What extensions are provided to the basic InfiniBand fabric?
- What topologies are supported by the IB fabric?
- What management, configuration and debugging tools are available and how well are they regarded by fellow users?
- Do the switch elements allow different configurations between the front-end and back-end ports to maximize utilization of the fabric?
- What is the highest port count and bandwidth on IB switches?
- What is the highest message count that can be achieved?