Josh Krischer, Josh Krischer & Associates
Economy in price and operation are the driving concepts behind XIV. As a result, its designers focused on operational simplicity and use of standard, off-the-shelf components to create a system that is not absolute top end but rather is good enough to support the vast majority of enterprise data storage needs. Traditional enterprise high-end storage architectures are switched matrix with host and device adapters, such as the DMX from EMC and Hitachi’s USP, or tightly coupled (CMP), such as IBM DS8000. Most traditional mid-range systems use dual controllers with mirrored cache, for example EMC’s CLARiiON, Hitachi’s AMS, HP’s EVA, Dell’s EqualLogic PS series, etc. In the last few years a new breed of block-based clustered storage systems design has been deployed such as 3PAR's InServ, LeftHand Networks SAN/IQ, NEC's HYDRAstore, Sun's Fire X4500, and XIV (IBM).
The IBM XIV Storage System Model A14
In contrast, the IBM XIV is a cluster-based array of storage devices with “raw” capacity of 180 TByte (usable 82.5 Tbyte), 24 ports of 4Gb Fibre Channel host connectivity and six ports of 1Gb iSCSI connectivity. The system is built from SATA disks. However the internal design sophisticated cache algorithms allow performance comparable to high-end systems built with FC disks.
Rack Structure Rel.1 (the initial models of Nextra sold by XIV)
- Each rack: 3 power supply USPs and 11 slots (e.g. 3 interface modules, 8 Data modules as standard), 2 Ethernet switch modules (The interface modules provide host connections)
- x86 Industry standard modules (modified Linux) , 4 GB cache and 15 slots for 1 TB SATA drives
- 1GbE in Rack, 10 GbE between racks (Infiniband in the future versions)
Rack Structure Rel.2 (IBM)
- Each rack: 3 power supply USPs and 15, 2U slots. The modules are unified data and interfaces.
- x86 Industry standard modules (modified Linux) , 8 GB cache and 12 slots for 1 TB SATA drives
- 1GbE in Rack, 10 GbE between racks (Infiniband in the future versions). The subsystem is designed to support up to 7 additional modules with a raw capacity of up to 1440 TByte.
- Multiple 1Gbps Ethernet connections between the modules
- In module interface 2x PCI-X with 8 GBps
Concept and data structure
The system is completely virtualized. The data is divided in 1Mbyte chunks, which are written on all disks in the system. To provide redundancy, each chunk is written twice on different modules, which ensures cache redundancy for write data. Each data module keeps some ‘primary’ and some ‘secondary’ blocks. Adding new modules will cause a proportional part of the data to be copied on them. A module failure will cause a reversed action; the “lost” chunks will be copied from the redundant copy on the remaining modules. These operations are completely automated and transparent to the users and the storage administrator. “Spreading” the data uniformly on all installed disks avoids creation of “hot spots” and ensures quasi-constant response times, which is very important for interactive applications.
The cache is managed in 4Kbyte blocks, which is ideal for databases and interactive operations. A random “no hit” read will transfer 64 Kbytes from disk to cache, with sequential accesses up to 1MByte.
Functionality
XIV supports advanced functions such as synchronous remote mirroring, thin provisioning, and writable snapshot technology. It supports up to 16,000 snapshots, which can be created in about 150 msec with a single click or command, regardless of volume size or system size. The “clones” can be read only or read/write. The snapshots can be gathered in consistency groups and managed as a group. It uses Redirect-on-Write instead of Copy-on-Write (as in the original Iceberg from STK, aka RVA, SVA) which requires only two seek operations (read, write) per write. The CoW requires three operations (read, write, write). RoW causes less write overhead, but a snapshot deletion or automatic expiration creates an overhead penalty as the data in the snapshot location must be reconciled back into the original volume. The huge number of the snapshots and their fast creation may be very useful for Continuous Data Protection (CDP) deployments.
Availability
XIV redundancy is designed on the system level and not the component level. The even load distribution on all drives improves SATA drives MTBF. The three power units provide N+1 power supplies with double-conversion (AC to DC to AC), eliminating power spikes, and provides batteries for 15 minutes up time. All the modules are connected by redundant Ethernet switches. The system continuously logs events and statistical data. As a background operation it automatically performs scrubbing and drives monitoring. In a disk failure, the rebuild is done only to allocated and written data (not to the full HDD capacity). Because the data is spread on all modules, all modules also participate in rebuild, which decrease the time to approximately 30 minutes for a 1TByte disk. A typical time to recover data module will take ca. 3-4h.
Storage Management
The key word here is simplicity! To create a new volume, the users define name and size in Gbytes or Tbytes. There is no need to create RAID groups, to decide on layout or to set configuration. The management is done through a GUI or by CLI. When an “event” occurs, the box notifies the manager by SNMP, e-mail, or SMS. Because the data is uniformly distributed among all installed disks, there is no need for performance tuning.
What is missing to be “real” enterprise?
One of the definitions of enterprise high-end storage is mainframe support. XIV doesn’t support the Count Key Data (CKD) mainframe format; therefore cannot be included in this group. It supports synchronous remote copy via FC or IP, but it is not PPRC v.4 compatible, and its asynchronous technique is “statement of directions” only. The current granularity is poor and it is not supported yet by the IBM System Storage Productivity Center (SSPC) SSPC. The 24 FC are significant less than any enterprise high-end system but much more that most in the mid-range.
Future Developments
Most of these points could be developed for future versions, depending on how IBM plans how to position the XIV within its storage portfolio. An interesting opportunity could be the integration of Diligent ProtecTier with XIV. This should not be a very complicated issue; Diligent uses Linux, and both design centers are based in Tel-Aviv, only few miles apart. An integrated system, with possible smaller cache can be an ideal backup and archiving solution for large enterprises.
References
XIV sold 40 systems before being acquired by IBM, 35 in Israel and five in the United States. Most of the customers in Israel are tier 1 corporations and include one of the country's two largest banks and the leading telco. Since the acquisition, XIV has continued to sell the box under own nomenclature, almost doubling the installed base. The first customer, Bank Leumi of Israel, currently has eight systems; this customer used (and still uses) high-end storage from EMC, HDS and IBM.
Action Item:The XIV launch from IBM is potentially a very disruptive event, a fact which most storage vendors have ignored. The combination of a system built from standard industry components and SATA disks with IBM's purchasing power can be an explosive mixture, with the potential to cause a global chaos in storage prices. It will allow IBM sales to offer prices which will win every deal where price is the main criterion. XIV is, however, not Tier 1 product, it is missing several functions but despite that it has potential for replacing many aging high-end systems successfully. Rather, its performance and connectivity places it between today's Tier 1 and popular mid-range (tier 2) systems. Its price and the management simplicity promise low CapEx and lower OpEX, which are on high demand in the times of weak global economy. IBM's umbrella of company viability, global services, sales and flexible financing provide the infrastructure and customer security which XIV was missing as an independent company. It is worthwhile for every organization to evaluate XIV as a part of its non-mainframe storage procurement.
Action Item:
Footnotes: