This article is written for storage managers and IT professionals interested in best practices of managing unstructured and structured data and information. The article defines structured and unstructured data, discusses how each is typically managed and how various approaches work, the benefits and implications of managing structured and unstructured data and how best to adopt a data management strategy that considers both types of information.
This stub has been created, in part using an article written by Fred Moore in the book 'New Horizons' from Horison Information Strategies.
Contents |
Defining structured and unstructured data
Data is usually considered structured, where databases with a defined structure are used to support applications, or unstructured which refers to non-structured information including emails, presentations, spreadsheets, documents and virtually every type of data imaginable.
It is important to point out that the lines between structured and unstructured are blurred. Not all information stored in a database is completely structured (an example includes image data types). At the same time, not all documents and so-called unstructured data is completely without structure. For example, many html documents contain tags and meta tags which indeed constitute structure, xml documents contain structure and many word processing files contain structures including formatting and tables.
How to managed structured and unstructured information
Typically, structured data is managed by technology that allows querying and reporting against predetermined data types and understood relationships. To the extent that database fields are clearly defined, meaningful information can be extracted based on desired relationships.
The management of unstructured data found in emails, presentations, spreadsheets, voice mails, images, etc is often less clear. Unstructured data can include non-language data types (such as bitmap objects) or textual objects based on a written language. The vast majority of stored information is regarded by most as unstructured.
Structured metadata across digital assets of any type begins to clarify management strategies associated with data types and starts to close the gap between structured and unstructured data management. Content management systems (CMS) and asset management systems are examples of technologies that bring some structure to information typically considered unstructured.
For example, these systems store and manage metadata about an individual file, where the metadata and the file are logically viewed as a single construct. The system manages relationships between these constructs and put files into a hierarchical structure. An example might include this wiki, which tags articles into subcategories and categories structured by topical area of interest to readers.
A single article in and of itself is evidently unstructured. But as a component in the context of a content management system, there is structure included and inherent to the wiki environment.
Approaches to managing structured vs. unstructured data
Today's databases and other technologies are bridging the gap between structured and unstructured information. Examples include textual objects using naming conventions, tags or taxonomies to identify digital assets. While often human intervention is required (for example in the case of applying keywords or metadata tags) the effort makes unstructured data more readable and useful in a machine readable context.