Originating Author: Robert Levine
Arranging and configuring storage for unstructured data like emails, graphics, audio, and video is an increasingly important part of any technology strategy, as such content composes at least 80% of an organization's data. This note is designed to address the following key issues:
- What are the business drivers and advantages of implementing a storage management scheme for unstructured content?
- How does one approach analyzing and scoping out such an initiative?
- What is the business impact of unstructured content solutions on the organization, and why does it need to be managed as tightly as structured content like databases?
- What are the risks and implementation issues of implementing such solutions?
The four components of a content management system are content, a content repository, a user interface, and a database / storage management system. Content is any kind of file that someone can create on a computer, typically things like MS Office documents, PDF files, and image files, but extending to sophisticated interrelated sets of files such as parts of a website or XML documents. A content repository is a server environment where all of these files are stored. A user interface is a window-like environment through which a user can get to and do things with the files. A database management system (DBMS) keeps track of information represented in these files, allowing users access to this information via different methods. “Structured” content implies data that is easily machine-readable and is usually represented in databases as records, rows, and columns stored often in normalized format. “Unstructured” content does not lend itself easily to standard database storage and retrieval methods, and includes audio, video, text documents, graphics, email messages, groupware files, and instant messages stored in multiple business applications and multiple data management technologies – often as binary large objects (BLOBs). It is commonly understood that at least 80% of an organization’s data is unstructured, and that this percentage is likely to grow particularly for emails and instant messages (Gartner says this unstructured content will grow from nearly 4 million terabytes today to as much as 15.2 million terabytes in 2009.) Gartner also estimates that white collar workers can spend as much as 40% of their time managing documents.
Unstructured content storage capability
With 80% or more of a company’s data in files of different types, it is not uncommon for large corporations to have terabytes of unstructured data stored on its file servers, PCs and laptops. Often this data is as critical to business operations as the structured operational or transactional data found in databases: a hospital may store x-rays and other medical images as graphics files, an insurance company keeps accident reports and photographs as image files, a publisher keeps manuscripts in PDF format, and an investment bank retains broker phone records, emails, and instant messages in audio and various messaging formats. Yet the ability of end users to electronically search for just the right information they need (such as all of Mr. Smith’s x-rays from 1998 to 1999) is rarely satisfactory. Moreover, as such types of data grow and grow, IT departments struggle to keep up with the necessary storage requirements. This has led to the development of a market for unstructured content storage solutions.
Specific operational goals of implementing unstructured content storage solutions
There are specific goals associated with implementing unstructured storage content solutions:
- It becomes easier for users to find the information they are looking for; the goal is that it becomes as easy to search unstructured content for text, pictures, audit, video, and so on as it is to query a database using structured query language or similar end user tools. This can facilitate business processes that depend upon knowledge sharing and management.
- The flip side of this argument is that where users cannot easily find information, time and money is wasted in reproducing and duplicating elements of content already stored in order to meet their needs. Further cost savings can be realized by electronically managing unstructured content (printing costs, for example, can be reduced by maintaining filings, policies, newsletters, press releases, product information, etc. online).
- Regulators who require information to be archived and managed are not making any exceptions for unstructured information. For example, regulated broker-dealers must save and archive email, instant messages, and dealer telephone records for later follow-up or audit.
- Operational and transactional data is usually backed up and retained, but employee knowledge represented by unstructured content can be irretrievably lost when the employee leaves.
- More organized content can improve customer and business partner relationships and make employees more informed and efficient by making key information more available.
- From a technology management standpoint, enterprise management of unstructured data can save money, impose tighter version control, allow better security of the underlying information, and allow integration with developers’ tools and workflow environments.
Risks of implementing unstructured content storage solutions
First, the business case for an unstructured content storage solution must be clear in order to avoid money ill-spent, or business requirements unmet. This includes prioritization of the most important types of content to be addressed, definition of the scope and key requirements of the initiative, and consideration of a method for classifying unstructured data. Naturally, this also involves staffing a project team, and preparing to invest considerable time in classifying unstructured data, configuring automated solutions, testing that content searches return relevant results, designing sensible archive policies, working with end users, and supporting this system once live. Poor planning and resourcing is the prime cause of implementation failure regardless of technology. In a related manner, it is important not to over-promise results; with so much unstructured content to manage, most analysts recommend phasing implementation and / or starting with a proof of concept. Finally, as with any new technology, unstructured content storage solutions are still relatively immature, and the possible range of implementation and support issues is still not fully clear.
The unstructured content storage solutions initiative
The business driver to initiate unstructured content storage solutions may originate from end users demanding better searchability and access to content, from within an information technology department looking to better control costs and manage the exploding storage requirements resulting from the growth of such data, or from auditors and regulators clear data security, backup, archival, retention, and retrieval policies and capabilities that extend from traditional databases to unstructured content management systems. Implementing an unstructured content storage solution is done by needs analysis, system design, and deployment / monitoring.
Much unstructured content – knowledge about products, services, processes, and people – is inside key people’s heads. Developing a system to capture and institutionalize this knowledge is key to managing this content – but outside the scope of an unstructured content storage solution implementation. Also, it is assumed that the organization’s core storage architecture is already in place; designing such an architecture is out of scope of this discussion.
While IT storage managers are most familiar with structured content, individual content managers are more familiar with unstructured content. So-called content authors use authoring or traditional desktop tools to create and author documents, spreadsheets, presentations, images, etc. Content consumers (who may or may not be the same as the authors) use desktop tools or a web browser to retrieve content.
The analysis phase begins with identifying the different types of content authors and consumers and understanding how they create, use, archive, and retrieve this information. This is challenging because unstructured information can be created and stored in disparate physical locations and file formats across the organization - file servers, email servers, imaging servers, individual desktops, laptops, and web servers, for example. Inventorying content involves gaining access to these environments.
Next, content should be classified (categorized, indexed, prioritized) to the extent possible. This also involves assessing the criticality and security requirements for this content. Some content will need a review or approval cycle, strict version management, and audit trail management, others will not. Finally, it is important to identify the backup and archiving requirements (if any) of the different types of content. The Analyze Phase can take a few to several weeks, depending upon the ease of access to the content just described.
Acceptance Test Considerations
The Analyze Phase is complete when unstructured content types have been identified, classified, subjected to security and controls assessment, and analyzed with respect to backup and archival requirements. This analysis should permit the scoping of an effort to arrange or enhance storage for such unstructured content.
Key analysis milestones
Milestones in the Analysis Phase typically include the following:
- Unstructured content types identified.
- Ownership of unstructured data clear.
- Unstructured data is classified according to a “taxonomy” (hierarchical structure).
- Security requirements (access control lists, privacy requirements, etc.) over unstructured data defined.
- Control requirements (review / approval cycles, version control, and audit trails) defined.
- Backup policies set (how often).
- Archival policies set (how often, to be retained for how long, need for easy retrieval from archive).
- Migration policies determine (how often, from which storage tier to which tier).
- Data mining or text analytics requirements set.
- Need for integration with enterprise applications determined.
- Scoping for unstructured storage content solution is finalized.
Before evaluating hardware and software solutions, it is advisable to investigate standardization, which can reduce the time, complexity, and cost of a solutions implementation by at least 30%. Is it possible to standardize your unstructured content around certain file types? In other words, it is necessary to have multiple file formats for audio files, for images, for documents, for spreadsheets, and (especially) for email?
There are data management standards but these are still adapting to unstructured content: ICE (Information and Content Exchange), WfMC (Workflow Management Coalition), and WebDAV (Web Distributed Authoring and Versioning) are three examples. IBM’s UIMA framework is an attempt to develop a standard for managing and processing unstructured content. In the absence of a widely-accepted standard, XML has become a de facto standard for content creation, representation, indexing, and presentation. At any rate it is important to evaluate any vendor offerings for compliance with these and emerging standards for unstructured content storage.
At this point the various requirements and standards can be put together into a design document and an RFI or RFP and vendors can be contacted for a solutions evaluation, following your organization’s processes. Don't forget to factor in growth in any storage estimates! When evaluating solutions, it is important to reflect your users’ requirements in the pilot solution as parameters governing storage limits, backup policies, migration and archival policies, and so on. Some solutions will automatically index certain types of unstructured content (make sure yours are included) as they backs up data. But the ultimate proof of concept of any software or hardware solution is when users can better find, classify, sort, report from, and retrieve information stored or archived in unstructured format.
With well-defined requirements, the design stage can be finished in a week or two.
Acceptance test considerations
The Design Phase is complete when service levels and roles and responsibilities are defined and agreed between customer and provider for each service in scope. Each service level metric must be measurable somehow in order for it to be meaningfully monitored post-deployment. In addition, key contact information should be written into the SLA, as should any arrangements for cost chargebacks or financial penalties.
Key design milestones
Milestones in the Design Phase include the following:
- Standardization and standards compliance investigated and implemented where possible.
- Requirements, standards, and parameters reflected in design documentation.
- RFI and / or RFP drawn up and sent to vendors.
- Vendor solutions evaluated against design documentation.
- Other elements of a vendor solution evaluation take place (vendor due diligence, final proof of concept testing, contract negotiation, etc.)
The Deploy Phase for unstructured content storage involves implementing a project plan, supported by a well-resourced team, to classify and implement index standards and policies in a hardware/software solution. As indicated, organizations are more likely to find success with piloted or phased implementations. This can mean initially implementing a solution for only one type of content (medical imaging files, for example), for one business area (trading and sales), and so on. That area would be willing to invest time in testing the storage concept and execution and in providing detailed feedback to the project team. Issues with the solution, or with the deployment process itself, can be fleshed out and taken into account in further phases of implementation. A limited scope deployment can take place in a couple of months, while full enterprise deployments can take several months depending upon the degree of standardization, number of content types, amount of storage required, number and nature of policies and rules, and detailed requirements for content access and retrieval.
The cost of not managing unstructured content can include long legal proceedings, non-compliance with legal discovery requirements, non-compliance with regulatory requirements for archival, and the loss of business opportunities or provision of lower quality products or services due to the inability to properly search content. Measuring the return on investment of an unstructured content storage solution is not any easier than measuring ROI for any other process improvement or compliance initiative (think about email, a telephone, or SOX compliance). But measuring total cost of ownership (TCO), quality of service (QoS), and metrics for capacity, latency, transfer rates, as well the processes like backups, recoveries, provisioning, inventorying, and archiving should be done as for structured data storage. One innovation is the use of the MAPS (Megabyte Objects per Second) measurement of storage efficiency - this is more applicable to multimedia and other unstructured content than KAPS (Kilobyte Objects per Second). Also, measuring the ratio of managed storage (for structured and unstructured content) to total storage is one way of tracking the deployment of the unstructured content storage initiative.
Acceptance Test Considerations
The Deploy Phase has been successfully implemented when the users are satisfied that their requirements have been met, and when the IT department is happy that their needs for storage management and storage cost reduction are addressed as well.
Key deployment milestones
Milestones in the Deploy Phase include the following:
- Project plan in place and agreed.
- Project team is formed.
- Funding for an unstructured content storage solution is secured.
- A solution is chosen.
- Applicable metrics are determined.
- A limited deployment is arranged and tested.
- Acceptance criteria are evaluated.
- Lessons learned are collected.
- A full scale deployment occurs.
- Final acceptance criteria (users and IT) are evaluated.
Deploying an unstructured content storage solution can take anywhere from a couple to several months to implement from analysis to deployment, depending upon the scope of the effort, the organization’s level of storage maturity, and the amount of funding and resources available to direct at the project. Knowledgeable storage management staff needs to be freed up from operational or other duties to participate in this effort, as do the users who have been identified as key to the proof of concept and testing activities. Since there are many types of solutions available in this emerging technology area, pinning an average cost on a hardware or software implementation is a tricky exercise; nevertheless, only smaller or limited scope implementations would normally cost less than six figures. Medium to larger scope deployments can cost a few to several hundred dollars. Organizations who have undergone such efforts have typically not added new staff to support the project or live system - but many bring on storage consultants, adding $50,000-150,000 to the cost of an implementation.