While budget growth is lethargic at best, demand for on-line data storage continues to grow, and power consumption issues are becoming visible at senior executive levels. The status quo will not be acceptable, hence the need to re-tool the thinking surrounding the management and storage of data.
Data classification enables storage administrators to understand the expectations associated with the corporate data under their stewardship. Inactive data can be migrated to lower, less expensive tiers, freeing expensive tier 1 storage to deliver the performance required for highly dynamic, business critical, applications. In short, user expectations for data performance need to be matched to the cost of storage.
In most organizations data is not growing, it is exploding! The problem with data is that once it is created, it is rarely eliminated, whether due to choice (corporate governance), indolence (poor data management practices), or compliance (government legislation). Thus the problems of managing data growth are only going to become more acute, particularly if current storage practices continue.
All data, whether transactional or persistent (fixed content), may have the same basic binary characteristic, but they do not the same usage patterns, particularly in today’s unstructured, content rich, image and video based, WEB 2.0 world. The criticality of content and its usage pattern establishes the primary data characteristics that differentiates data and enable useful storage classification. Different data has different performance expectations with access latencies bridging from milliseconds to seconds or even days (how about never), depending on the data in question. Data characteristics are defined by access requirements, retention time, integrity requirements, content, and technology longevity, all variables that impact how data should be managed, stored, or deleted and the choice of storage technology.
So why should data characteristics drive storage selection? Simply put, knowingly placing inactive data on a tier 1 platform is just as daft as expecting tier 3 or 4 storage to satisfy your service levels obligations for business critical, transactional data. Neither scenario illustrates good stewardship of corporate assets, whether digital or physical. While scenario two is unlikely, unfortunately inactive data is consuming significant tier 1 resource in many data centers. According to industry analysts such as ESG and the Tanaja Group, an estimated 70% to 80% of data is inactive or persistent within the data center. Interestingly in a recent survey of storage professionals conducted by COPAN Systems, 55% of respondents thought that this number was less than 50% with 20% who did not know how much persistent data they had, never mind managing it. Of those who were aware of their persistent data, 31% used primary storage and 40% used a combination of archive and primary storage to store and manage it. Yes the persistent data was consuming expensive storage, but it was safe and available if needed, and by the way the IT manager was not accountable for the utility cost of these gluttonous, power hungry storage frames. That, however, is changing as the new financial realities impact budgeting and corporate spending. More prudent management of storage resources is required.
Data makes up the digital history of an enterprise. It is a corporate asset (and a liability) which holds considerable value (and risk). However, the value of data is variable and tends to be time and activity dependent. Value can be influenced by the age of the data, by access requests whether internal or external, and, while not characteristics per se, its find-ability, its time to be accessed, and its integrity are all critical influencers. That being said, a great deal of worthless data being stored and managed in many an enterprise, at some measurable expense. Hence the need for robust data classification and management practices.
Data types can be simply classified as either highly active, transactional, or inactive, primarily historical/reference data, aka persistent data:
- Transactional Data – This is the tradition view of data and has molded today’s disk storage architectures. It is being captured or created, is highly dynamic, drives high IOP, is random in nature and tends to have a short shelf life. This is why traditional, transactional designs are a read-write-modify-access model, they are optimized to provide access to data at all times, they are cacheable or demonstrate temporal and spatial locality and are optimized to small grain data access.
- Persistent Data – This data is rarely accessed or modified. It does not demand the same response time, has low IOP demand and tends to have a low temporal access locality, meaning caching is a wasted expense. Persistent data tends to have a long term retention requirement, is bandwidth centric, has data integrity concerns, is and likely to be event driven, immutable, reference content and is the fastest growing segment of today’s digital information. As referenced earlier 70% to 80% of data in a data center fit this description.
Differentiating data types does not enhance or diminish the relative importance or value of data, it simply improves the chances for cost-efficient storage, its effective management, availability and use. There is agreement that persistent data is the fastest growing data type, in terms of data volume in the data center. The reason is that much of this data is subjected to minimum retention periods that are dictated by one compliance regulation or another. Not only must this data be retained, but when requested it must be available in a timely manner. There is a history of significant financial penalties levied on companies who failed to deliver data in the time required by law.
By appreciating that different data has different value and be willing to execute even on the simplest of classifications enables the astute data storage manager to effectively match the value and requirements of their data to the cost and performance of the “hosting” storage technology.
This realization is the epiphany that opens the door of opportunity to significant cost savings while surviving today’s phenomena of explosive data growth.
Action Item: By understanding their data, storage administrators can develop storage strategies that match the value of the data to the cost of the storage platform. This action will potentially delay costly CAPEX decisions and minimize OPEX expenses.
Footnotes: