Contents |
Summary
Tiered storage is an idea going back at least to 1990, when Gartner was talking about a three-tiered storage architecture as a way of increasing storage efficiency. With the explosion of data of the last 12-24 months, the idea has gained interest as a way of controlling costs and the run-away growth of disk farms. But in practice, active tiering has been confined to homogeneous storage environments, which in most shops means islands of storage usually involving only two tiers, because of the lack of automated tools to manage the data. Tiered architectures are becoming more common, but usually they are passive, with all the data for a particular application assigned to a specific tier either to meet the application’s technical requirements or in response to the political clout of the business owner of the application. That data then spends its entire life on that tier, until eventually, usually long after the data has ceased to be useful, it is deleted by the data manager. This is a far cry from the vision of the active tiered system, with the newest, most active data on Tier 1 or 2 and older data migrating to a lower tier of less expensive media as it ages and reads drop to a few a month. That, however, may be changing with the introduction of new technology promising automated data management across large, heterogeneous storage environments. With that in mind, this paper offers a multitiered, heterogeneous storage model with some technology suggestions based on the two years of Wikibon research.
Why Tiered Storage
Tiered storage has been an attractive and popular theory since at least 1990, when Gartner analysts talked about a three-tier storage model, with data basically entering the model at Tier 1 and then moving down to Tier 2 and finally Tier 3 as the data ages and accesses drop to near-zero. This of course is an oversimplified statement of the model, which also must take into account the needs of specific applications and whether the data will be needed for quarterly and year end financial reports among other things. But the insights behind it are still valid today. Multiple studies show that data access follows the 80/20 rule: the most recent 20% of the data attracts 80% of the access. This means that after a certain period – a few weeks in most cases – data has aged to the point that it is seldom read. Keeping this older data on expensive, high performance media, when that level of performance is no longer required to support its use, wastes money. In the present environment, in which data growth is swamping data centers and driving huge expenditures in storage farm growth and companies are facing a severe recession, IT cannot afford this waste.
This problem is most acute with transactional data such as that generated by ecommerce, that usually occupies the most expensive, fastest media. The extreme case is in financial trading, where the latest market data is extremely valuable, to the point that trading companies generally use expensive solid-state storage to capture it. But that value drops precipitously literally in minutes; yesterday’s data is only of interest for historical trend analysis, and week-old market data may have no value. In theory moving older data to less expensive, lower tier media, can save on CapEx while improving performance at the top tiers by eliminating data “clutter”.
In practice, however, tiering has only been used in very limited, usually homogeneous storage environments due to four critical problems:
- Lack of automated data classification tools,
- Lack of automated policy management tools,
- The immaturity of storage virtualization and consequent lack of fully virtualized environments,
- Lack of support for heterogeneous storage environments.
As a result, while many data centers have a form of tiering, the tiers are not integrated in any meaningful way. Data sets are assigned a tier based partly on the needs of the application but often largely on the clout of whomever on the business side “owns” that application. Once written onto a particular drive, that data usually stays there virtually forever – certainly long after any real need for it has expired – after which it is deleted and usually only lingers on backup tapes. Even as CIOs lament the exploding growth of disk farms, which are chewing up IT budgets and overcrowding data centers, they maintain disks full of data that no one has looked at for years, and often those are expensive Tier 1 and Tier 2 systems.
However, this situation is changing. Some suppliers have released tools that do a good job of automating policy management and virtualizing storage systems, while others do a good job of virtualizing and supporting data migration in heterogeneous environments. So far, however, strong automated data classification remains a hope for the future. With these changes in mind, and in the hope that an automated data classification tool will appear, Wikibon offers the following five-tier model for storage systems:
Why a five-tiered model?
At first glance, five tiers seems excessive. After all, the original Gartner model was only three tiers, and most homogeneous tiered data management systems today only use two. And the truth is that many organizations today do not need five tiers as it adds to complexity. Three or in some cases two will meet their needs for management of active data. Tier 4 was added to cover effective long-term archiving of inactive data such as old email files that companies are now maintaining largely in response to legal requirements and potentially for research into unofficial lines of communication and influence in their organizations. While many enterprises still depend on backup tapes to fulfill this need, this approach is not sufficient for several reasons, as companies have discovered to their detriment in a couple of high profile cases.
Tier 0 reflects the specialized needs of a relative few companies active in financial and other highly volatile trading markets where very fast access to transactional data can mean the difference between making and losing large amounts of money. While most companies do not have this need, it is included to provide a complete model and because new high speed disk technology may compete with solid state disk in this tier. Individual data managers can adapt the model to fit their organizations’ needs.
The Tiers
Tier 0: Business need: Extremely time sensitive, high value, volatile information needs to be captured, analyzed and presented at the highest possible speed. The primary example is currency trading. Note that this is a special-case situation not found in most business environments. Storage solution: Only storage with the highest, subsecond response speeds is good enough in the currency trading environment, where a single trade can make or lose more than the cost of the entire storage system. The usual solution is solid state storage, although new high speed disk technologies may compete in this space.
Tier 1: Business need: Transactional data requires fast, 100% accurate writes and reads either to support customers or meet the requirements of high-speed applications. One common example is online retail. Numerous surveys have shown that even relatively short delays in response to customer actions can result in lost sales, making high performance storage essential. Storage solution: Generally latest-generation, high-speed disk systems are used. These systems carry a premium price, but this cost is justified because slower performance systems would directly impact the business. However, even as disk becomes faster, solid state storage prices are decreasing and availability is increasing. As this trend continues solid state “drives” will find their way into the Tier 1 systems of increasing numbers of organizations.
Tier 2: Business Need: This tier supports many major business applications from email to ERP. It must securely store the majority of active business data, where subsecond response is not a requirement but reasonably fast response still is needed. Email systems, which generate large amounts of data, are a prime example. While users can tolerate slightly slower response times that is required for transactional systems, they are quickly frustrated by consistently slow response. Solution: Tier 2 technology is always a balance between cost and performance. The latest entrant in this tier is XIV, now part of IBM, which offers large storage volumes and good-enough performance for Tier 2 at a very low price. The one catch is that to accomplish that, XIV systems come in two standard sizes. Multiple systems can be chained together to handle larger amounts of data, but the size minimum can lock out the lower end of the SMEs.
Tier 3: Business Need: As data ages, reads drop off rapidly. However, that data often is still used for trend analysis and complex decision support. For instance, financial data needs to be kept accessible at least until the end of the fiscal/tax year. However, it does not need to stay on more expensive Tier 1 and Tier 2 systems. Similarly emails more than a couple of weeks old are seldom accessed, but the business may still find it desirable to keep them on easily accessible systems. Solution: Tier 3 technologies can have two different characteristics. Much of the data in Tier 3 is really semi-active. MAID technology is a good choice for that data. However, it also handles data that supports decision support analysis. Businesses that do a lot of complex analysis of historical business data might consider a storage system designed to support complex queries such as the Sybase IQ series of column-based systems designed specifically to support complex analysis as a Tier 3B solution for that data.
Tier 4: Business need: Compliance requirements today are driving a tremendous explosion in storage for historical data. In the United States, for instance, state and federal civil courts often require companies to produce large amounts of historic emails, sometimes going back years, in civil torts. Often businesses rely on backup tapes to recover this data. However, backup tape procedures were never designed to preserve data going back several years. Tapes are lost, reused, or may have deteriorated. The technologies to read them may no longer be available. J.P. Morgan and other companies have learned the inadequacies of using backup tapes to archive old data to their cost in some highly publicized court cases. Backup tapes have another problem when used for archiving. They contain a snapshot of the entire corporate data population across all applications at a particular moment. This means that when an ITO is responding to a court request, for instance, for emails between specific people between specific dates that may span several years, it has to resurrect and search a large number of tapes containing extraneous data such as corporate financials and HR records for individual emails and reconstruct them into a chronological file.
Businesses need a better system for long-term storage of historical data. This is as much an issue of procedure as technology. Rather than backup tapes, it needs tapes or other media containing archives of a specific data type – for instance all corporate email week by week. These need to be formally archived in an organized fashion with full records of exactly what is on each tape or other medium, where it is, what technology is needed to recover it, and when it needs to be migrated to a new physical medium to ensure that data integrity is maintained. Migration procedures need to be established to ensure that each tape is replaced before it reaches its end of life, and old tape technologies that are being replaced in the data center need to be preserved in the archive to read old tapes.
The solution: Tier 4 will contain very large amounts of data, but on the other hand no one expects this data to be instantly available. Courts, for instance, routinely give organizations two-to-four weeks to produce documents in discovery. For this reason, tape is by far the most cost-effective physical medium for much of this data. If the old data is needed either to respond to a court discovery or to support internal analysis, typically users can tolerate the time needed to mount the relevant archive tapes. Removable disks may also be considered. However, they tend to be more expensive and delicate than tape.
Conclusion A multitiered storage system with automated data movement provides the best solution for managing the data explosion IT is experiencing. While not all companies need five tiers, they do all need at least three, including a data archive. Organizations that do not archive their data take a large risk if they are ever involved in a civil or criminal action and are required to produce historic documents. Overall, active tiering provides the best solution for supporting service level agreements at the minimal cost and with highest efficiency.