Storage Peer Incite: Notes from Wikibon’s March 27, 2007 Research Meeting
Dave Vellante presents Data classification: Brains or brawn? New business value drivers necessitate a break from historical methods of classifying data. Auto-classification is a key pre-requisite that must be designed into data architectures early in the process.
Contents |
Data classification: Brains or brawn?
Dave Vellante and David Floyer
The current state of data classification is largely a byproduct of historical, hierarchical storage management (HSM) implementations where data age is the primary classification criterion. Early visions of classifying data based on business value never fully came to fruition because it required a manual, brute force approach and was too hard to automate. Age-based classification enabled automation processes to be more easily applied to data classification initiatives and became the de facto standard.
A new emphasis on compliance, discovery, archiving and provenance substantially challenges existing data classification taxonomies. New business value drivers include 'never delete' retention policies as well as performance, availability and recovery attributes which are the underpinning of resurgent data classification efforts. While generally age-based schema predominate, they must more aggressively incorporate richer classification attributes. However this extension should be accomplished with an eye toward automation where data set meta-data is auto-classified upon creation and/or use of the data set. Future data classification efforts will involve much broader perspectives and serve as the mainspring of multiple enterprise initiatives, including: ILM, tiered storage, email archiving, decision support, data mining, electronic content management and compliance. In short, data classification will serve as the foundation for information value management and while the manual development of business categories is always necessary, without auto-classification there is no chance of success.
Action Item: IT organizations must break with the past and make business process, not age of data sets the defining catalyst for classification schema. This approach will not scale without auto-classification capabilities that assign meta-data to data sets at the point of creation or use. Emerging tagging methodologies borrowed from social networking may provide a complementary user-driven approach, but these will not suffice for compliance and legal requirements.
Data classification value transcends storage efficiencies
Traditional drivers of data classification from a storage point of view have been to improve efficiencies, map data and device characteristics and better serve application users. More than ever, with compliance, legal discovery and audit initiatives influencing corporate agendas a new value proposition is emerging where classification can enable the reconstruction of a continuum of organizational activities performed and decisions made over a period of time.
What this means is that the traditional reliance on a 'corporate memory' to piece together a series of events, or conduct a cumbersome discovery has the potential to be supplanted by a much more reliable and auditable system of infrastructure, meta-data, applications and business processes. To be sure, the justification, internal arm-twisting and development of this capability will not be trivial; however the technologies, regulatory imperatives and competitive pressures are coming together in a sort of perfect storm scenario that will dictate investment in this area for the next several years. At the heart of this opportunity is the automatic creation of classification meta-data and the enticement of users to provide meaningful input into the process.
Action Item: IT must sell the vision of how enabling automation of meta-data will drive huge improvements in productivity and facilitate the exploitation of untapped corporate knowledge. Application owners must be persuaded to develop meta-data creation function and supporting architectures. Finally, meta-data creation must be simplified in order for end users to participate in the process and add incremental value.
Data classification: So much more than storage optimization
Storage executives have traditionally been responsible for data classification implementations. Data classification is a fundamental building block for effective ILM and archiving initiatives and the potential benefits to the organization go far beyond storage optimization. However out-of-scope organizational requirements can disrupt the initial objectives of data classification projects and managers must be extra careful of scope creep.
In order for IT to implement a full data classification architecture, detailed assessments will be needed with legal, audit, risk management, business lines, architects regarding metadata architecture, application developers and owners to determine metadata automation requirements and operations professionals. In the meantime, storage executives need to limit the scope of any data classification project to what can be achieved in the immediate term.
Action item: Executives responsible for storage must keep data classification schema simple and limited to data that is system generated (e.g. date of creation and last use). While necessary, expanding the scope of classification efforts should not proceed until data classification schema are defined and automated methods of generation are in place. Relying on any manual entry of classification information will doom data classification projects to failure.
Data classification: Managing metadata
Metadata is data about data, and enables data management. Provenance and respect for order are guiding principles for data management. Metadata includes when data was created, who and/or what created it, where the data was used, and when it was destroyed. We need to be confident that the data was not changed without record. Metadata is a key enabler for data classification.
Applications and users create data, and should create the metadata at the time of creation or use. Metadata is additive in nature, and does not need a single point of control. Operating systems, applications, system management software, databases, storage management software and storage hardware are all important contributors to the creation and storage of metadata. The creation of metadata has to be automated for applications, and made as simple as possible for end-users.
Action item: The key imperative for enabling data classification is automation of the creation of metadata. The first and most important step is to agree metadata types, and the layout and structure of each type of metadata.
Auto-classification of metadata means truckloads of terabytes
Automatic data classification at the time of data set creation should have vendors salivating. This is because the amount of storage created will easily be twice the amount of core information captured (consider all the meta-data associated with a bounced email). Perhaps more importantly, auto-classification will remove a barrier to projects related to ILM, tiered storage, electronic content management, email archiving, etc. Savvy buyers will not disrupt progress due to the added storage expense but in order to capitalize, storage vendors must enable a new class of application that will exploit classification meta-data. This means suppliers must re-tool technology portfolios to provide solutions that perform function such as the following:
- Organize and classify meta-data
- Enable auto-classification
- Exploit meta-data using file system, index and search functionality
- Accommodate classification meta-data tables in high speed cache
- Provide high performance data movement
- Enable asset discovery and meta-data analysis
- Secure and encrypt meta-data
Action Item: Key strategic initiatives supporting corporate and regulatory mandates will not go unfunded. Vendors that address the growing problems presented by the lack of solutions to automatically create classification metadata will reap the greatest rewards. Developed solutions must be ecosystem friendly with published entries and exits into key technology components that entice and facilitate partnerships.
How much meta in the data
As storage teams expand use of data-driven storage technologies (e.g., virtualization, tiered storage), pressures to formulate comprehensive meta data strategies at the level of storage increase dramatically. Moreover, meta data from technologies like storage virtualization, which intervene between applications and previously dedicated pools of storage, may even supersede many classes of application-level metadata. Traditionally, storage administrators have created modest amounts of meta data, usually dictated by device types or formats. However, piecemeal approaches to creating, manipulating, and using meta data will not work where enormous volumes of data have to be logically integrated for storage automation to operate. The good news is that conventions, methods, and tools for managing meta data are very mature. The bad news is storage professionals typically know nothing more about them than how much storage they require.
Action Item: Storage professionals must begin formulating realistic storage meta data strategies that work in concert with enterprise meta data approaches, borrowing knowledge, tools, and methods to rapidly prepare for adoption of emerging data-driven automation technologies.