Tip: Hit Ctrl +/- to increase/decrease text size)
Storage Peer Incite: Notes from Wikibon’s September 29, 2009 Research Meeting
When your data comes from outer space, you have unique challenges. At the Jet Propulsion Laboratory at Caltech, which manages images from the Spitzer Space Telescope and other orbital and terrestrial infrared telescopes, the problem is not the huge amount of data - although that runs to multiple petabytes. Nor is it the very high data transfer speeds. The basic problems are that because of the nature of the data feed all that data comes in in literally a billion very small files and that everything has to be done on a very tight budget.
How Caltech solved these two problems was the focus of the Sept. 29 Peer Incite Meeting, which featured a presentation by Eugean Hacopians, Senior Systems Engineer at Caltech.
The file management problem, he said, turned the JPL environment into a tape eater. The lab's one experiment with restoring from tape took weeks and resulted in multiple tape and tape drive failures. As a result, while the lab still archives to tape, it uses mirrored storage to provide live backup of all data as it is captured and processed. It uses a fully componentized system, building standard storage "blocks" in the lab. If a component fails in one of those blocks, the mirror can immediately take over as primary while the component is replaced. Building its own storage blocks requires more work, but it allows the lab to make use of older, off-maintenance components where appropriate and allows it to build up a large supply of older spare parts that can be used to replace failing subsystems.
It keeps all its data on these mirrored systems with no off-site archiving. This is also mandated by the needs of the astronomers it serves. Unlike business data, astronomical images do not change quickly, and astronomers regularly use images made years ago. It does use MAID disks for less active data, shutting those disks down and parking the heads when they are not in use. This saves power and cooling and wear on the disk drives, and Hacopians reports no problem restarting those disks.
The lab deals with the cost issue in part by keeping its components beyond their three-year warranty period, assuming the risk of failure and moving the older components to less critical parts of the system and, sometimes, to research projects that lack adequate funding to purchase new storage. G. Berton Latamore
Managing billions of small files effectively requires a clear understanding of data flows and a system based on common Lego-like building blocks that provide services to application owners.
This was the message at the September 29th, 2009 Peer Incite Research Meeting, where an industry practitioner, Eugean Hacopians, Senior Systems Engineer at the California Institute of Technology (Caltech), addressed the Wikibon community.
Caltech is the academic home of NASA’s Jet Propulsion Laboratory. As such it runs the downlink for the Spitzer Space Telescope, NASA's orbital space telescope, as well as 13 other missions, processes the raw data into images, and supports the needs of scientists visiting from locations worldwide. The focus of this discussion was the activities of the Infrared Processing and Analysis Center (IPAC), which has evolved to become the national archive for infrared analysis from telescopic space missions.
To be sure, Caltech’s needs are on the edge. The organization is the steward for more than 2.3 petabytes of data created from its 14 currently active missions. Caltech captures data from these missions and performs intense analysis in what it calls its ‘Sandbox’, a server and storage infrastructure that supports scientific applications that analyze the data. Once ‘crunched,’ the data is moved to an archive, using homegrown data movement software.
Hacopians explained to Wikibon members that due to the nature of the downlink, the files managed by Caltech are small, ranging in size from 5-25 kilobytes. But there are a lot of them -- billions or even trillions. Caltech had previously attempted to use HSM software and tape but quickly realized the environment was not appropriate for tape libraries. Hacopians called it a ‘tape killer.’
The team at Caltech had to design a cost-effective means of providing reliable data access to all this scientific data. As well, organizationally, the projects supported by Caltech had to be completely walled from each other from an accounting standpoint. Rather than implement a shared SAN infrastructure with onerous chargeback mechanisms, Caltech decided to use a common set of technologies that would support each of the projects. The technological building blocks are:
- A Sun Solaris server running the ZFS file system,
- A QLogic 5602 FC switch,
- One-to-three Nexsan SATA Beast arrays.
Caltech uses Nexsan’s Automaid spindown capabilities in its archive to reduce energy costs, using Level 1 (slowing the spin speed of the disk) and Level 2 (parking the heads after sufficient inactivity). It does not put the drives into sleep mode (Level 3) and has never had reliability problems associated with spinning down devices.
Caltech uses SAIC tape for long term archiving and last resort off-site disaster recovery. However, its own tests indicate that because of the huge number of small files involved, recovery from tape would take weeks or longer.
This building block approach has allowed Caltech to use common configurations across its infrastructure. Caltech derives four main benefits from this strategy:
- The infrastructure is architected for fast, simple, safe recovery from failure or data loss.
- The approach scales nicely in support of Caltech’s data growth, which occurs in large chunks of hundreds of TB’s and billions of tiles at a time.
- It streamlines staff training.
- The "Lego" building-block method allows Caltech to reuse infrastructure when it comes off maintenance, providing it with large numbers of spares and saving money.
Caltech uses a cascading refresh approach when new infrastructure is purchased, placing the newer equipment in support of the most critical parts of the infrastructure and migrating older equipment to less mission-critical areas. In this case, the archive is the most critical as it houses massive numbers of files that scientists access for their research and because it is regarded as a National Archive, which should be kept indefinitely. The Sandbox infrastructure is the least critical because data is quickly migrated off it into the archive.
Benefits of the Approach
The choice of building-block versus shared-SAN infrastructure is an interesting one. While it may appear more expensive, because in some cases Caltech may be over-provisioning resources to support an application, on balance the benefits outweigh the costs. Caltech has only three individuals looking after all this infrastructure and the system has been extremely reliable. Training costs are low because of the commonality across infrastructure and data integrity has been high. The organization has not lost big chunks of productivity due to data loss or complicated recoveries.
The Nexsan infrastructure is a good fit for Caltech for two primary reasons: Caltech’s applications are well-suited to using high capacity SATA arrays as part of its building block strategy, and Nexsan support has been very responsive, assisting Caltech in both architecting the building block (from a storage perspective) and rapidly solving problems. Caltech has avoided jumping on the fad du jour (e.g. object-based storage or Cloud Computing models), preferring rather to stick with a proven approach.
Action Item: The challenge of managing many billions or even trillions of small files presents issues above and beyond difficulties of managing capacity and growth. In this type of environment, IT organizations must understand the type of data, the rate of data change, and the flow of data before settling on an infrastructure and methodology to support applications. Taking a building-block approach, using common server, interconnect, and storage components will simplify installation, maintenance, and training, and support more facile, reliable recovery from data loss.
A discussion with the California Institute of Technology's (Caltech's) Infrared Processing and Analysis Center (IPAC) evokes memories of Carl Sagan. While Caltech has some unique data attributes that are not necessarily widely applicable to many CIO’s, organizations facing enormous growth may be able to learn a few things from the Caltech case example.
At IPAC, it's not so much capacity growth that's the challenge--even though growth comes in large 100TB chunks that present some non-trivial issues. Rather it's the number of files (many billions and even trillions) that created the primary challenge for the organization-- reliable recovery.
As such, Caltech architected its infrastructure to address recovery ahead of other requirements. This was not necessarily an obvious strategy based on the initial requirement to analyze and house massive amounts of telescopic mission data. But after thinking through the type of data and data flows, Caltech realized the main challenge was not so much how to scale for capacity but how quickly it could recover from a data access problem.
The nuance of managing many small files places unique requirements on IT, especially with respect to recovery. This is a main reason that Caltech chose not to go with a large centralized SAN and instead architected a series of mini storage nodes using a building block approach around Sun servers, ZFS, QLogic switches and Nexsan SATABeast arrays.
The lesson here is that as requirements are established, they go through many revisions. Often IT organizations (ITOs) are rightly concerned with scope creep, but it's imperative that ITOs don’t take the requirements as the Bible. Organizations need to decode requirements and think through the implications of architectural choices. After doing just this, Caltech chose to go with a Lego building-block approach. While this may not be appropriate for all use cases, the point is that for Caltech, this dramatically simplified recovery and reduced the risk of losing access to a centralized storage archive.
Action item: The number of files or volumes, more than amount of capacity, will often be the gating factor in terms of managing storage growth. Especially in exceedingly high growth environments, organizations must understand how data is created, processed, accessed and protected in order to truly meet business requirements. Taking a building block approach, where standardized server, switch and storage components are used, can simplify infrastructure and create commonality across applications which lowers risk and costs.
How does Caltech build its server and storage infrastructure for its very custom infrared astronomy research application with literally trillions (yes, trillions) of files? We heard from Eugean Hacopians, Senior Systems Engineer at the California Institute of Technology on the September 29, 2009 edition of the Wikibon Peer Incite. Eugean described his cookie-cutter approach to building out server and storage infrastructure for their very custom application.
In order for the very small IT staff to provide server and infrastructure resources for a large number of scientists and their accompanying huge amount of infrared astronomy research data, Caltech uses a standard configuration of servers and “SAN in a box” storage infrastructure. It builds and replicates each block of server and storage infrastructure using identical configurations to provide a consistent experience for the users and to standardize their own IT maintenance and training processes.
Caltech builds redundancy into its Sun Solaris servers, using the ZFS file system, QLogic Fiber Channel host bus adapters (HBAs), QLogic 5602 Fibre Channel switches and Nexsan Technologies SATABeast storage systems. The ZFS file system handles the huge number of files that are required, and the SATABeast storage handles the volume of data required. Each block of server and storage infrastructure in its server farm has redundant components and is designed for high availability and ease of component replacement in the relatively rare event of failure. As much as possible, the configurations are identical, including RAID configurations, storage connectivity, etc. In addition, it keeps some spare switches and other components on standby as needed.
Caltech also keeps the equipment for several years in order to maximize its investment of project funding. In keeping the equipment beyond the life of the project originally requiring the equipment, it deploys the older equipment for other projects with inadequate funding to support their storage needs. Because it keeps the equipment for long periods of time, Caltech has built up its own supply of spare parts, sometimes becoming a second source for the original supplier after the production run of particular models of equipment has ended.
Action item: Sometimes the best approach for a custom application is to provide simple, cookie-cutter server and storage configurations, so that the focus can remain on doing the work of the business and less on the IT infrastructure. In so doing, IT can maximize the use of its people, budgets, and equipment.
Trust, common interests, and shared values are the foundation of any relationship. In particular, decentralized and/or smaller organizations can benefit a great deal from developing and maintaining good vendor relationships.
In the September 29, 2009 Peer Incite, Caltech's Eugean Hacopians gave an example of how he has established an excellent relationship with storage supplier Nexsan Technologies. Caltech’s philosophy is that IT is responsible for understanding the user requirements and for architecting the solution. What they need from their storage supplier is detailed information on how the storage works and advice on how the architected solution could be improved. They actively resist adding risk to the solution by constantly deploying the latest IT technology. Of particular importance to Caltech is flexibility in this relationship; they recognized that the IT solution is going to change as the true nature of the overall project requirements become clearer. Caltech reduces risk by using a “Lego” building block approach to design and by using the same components wherever possible. Buying building blocks from one vendor and knowing they have the flexibility to move these blocks around works better than attempting to optimize a “one-off” solution for a particular problem.
Action item: There are as many different ways of setting up a good working relationship with vendors as there are of skinning a cat. One useful test of the quality of that relationship is whether your vendor's CTO or senior director(outside of sales) returns your call.
Repeating a theme that is echoed across the Wikibon user community, Caltech’s Eugean Hacopians, a senior systems engineer who supports more than 2.3 PB of data for the academic arm of NASA’s Jet Propulsion Laboratory (JPL), cautions technology buyers to be wary of vendor performance claims and chides the bulk of vendors who seem more intent on launching into sales pitches than on making the effort to understanding his unique requirements. Hacopians had especially harsh words for several large storage vendors for their seeming lack of responsiveness to JPL’s needs and staunch adherence to their own marketing and technology agendas.
While JPL is only one small entity within the Caltech universe, as part of its charter to provide scientists with telescopically collected infrared images of the entire “visible” universe, it stores massive amounts of unstructured data that needs to kept “forever”. This includes trillions of small images or objects at roughly 5 million per TB. Their archiving software is homegrown and, unlike most archival solutions, JPL cannot take advantage of de-duplication or single instancing functions to reduce storage growth, nor can they delete data over time.
With approximately 2,500 spinning disks to manage today and unlimited data growth for the future, along with a potentially high profile reference client in the offing, one might be led to believe storage vendors would be tripping over themselves to have JPL as a client. However, with tight budgets, Hacopians has to make due with less than bleeding-edge technologies as well as keeping an inventory of spare parts and allowing storage products to go off maintenance in a “use it until it dies” strategy.
On the other hand, as additional storage or technology refreshes are needed, JPL has deployed inexpensive SATA drives to lower cost per GB and to cut energy costs, and Hacopians has implemented Nexsan’s autoMAID (massive array of idle disks) technology. Hacopians says that, in his experience, small vendors are more responsive to smaller accounts such as JPL, which spends less than $1 million per year on storage.
Consistent with Hacopians’ statements regarding vendor behavior, during recently held roundtable discussions facilitated by Wikibon, many users related their frustration with vendors, whom they experience as being more interested in selling what than in understanding the buyer’s unique requirements. Continued commoditization resulting in lower cost per GB, improved reliability of storage components, and the availability of these components to a broader spectrum of vendors - including smaller players willing to innovate and build relationships with customers while providing superior customer service - will increase the pressure on the largest storage solutions providers, who are not meeting the needs of mid-market or small customers and, in some cases, even erode their market share in the biggest accounts.
Action item: Vendors need to ask good questions, show interest in helping customers solve business problems, and not launch into a product pitch before taking the time to understand a customers requirements. This is especially important in edge environments. Don’t rush to propose a solution before gathering the type of information required to ensure a good fit. Take the time to understand the prospect and their environment and be prepared to walk away from a deal or recommend another solution when appropriate.
Dealing with a Billion file Cache and a Petabyte Library: Caltech only Retires Hardware When it Breaks
Capturing electronic satellite and telescope imagery creates a raw data cache with billions of small files occupying over a petabyte. Once that data is processed, it is put into a public archive library also of petabyte-scale. The archive library is the mission-critical part of the infrastructure. If it is lost months or even years of work are lost.
When Caltech does a hardware refresh, new stuff goes to the more mission or operation critical archive library first, and the older storage is pushed down to the raw data cache in a cascade or waterfall scheme. This strategy generates more work, but it provides better reliability at the high end of the infrastructure.
For managing the raw data cache, a sandbox-like approach is used. Caltech’s sandboxes were built with Nexsan ATABeasts, each with about 400 gigabytes of drive capacity, and are now more than five years old. Caltech’s strategy is to never get rid of hardware until it dies. In Caltech’s experience, controllers and chassis don’t go bad -- only disk drives go bad, and these can easily be replaced. Caltech uses a spare parts approach. When equipment comes off maintenance, Caltech takes on the risk and inventories older arrays for spares.
Action item: Understand the data flow and refresh infrastructure intelligently, using a waterfall methodology to cascade older infrastructure to less mission critical parts of the application and point the newer gear toward the most important parts of the application. The downside is it’s more work this way, but by having commonality across the board, everything is interchangeable.