As more valuable personal and corporate information is stored in both public and private clouds, organizations will increasingly rely on an expanded view of metadata to both create new value streams and mitigate information risk. By its very nature, cloud architectures allow greater degrees of sharing and collaboration and present new opportunities and risks for information professionals.
Incremental value from cloud metadata will be created by leveraging "associative context" created by software that observes users and their evolving relationships with content elements that are both internal and external to an organization. This software will create metadata that allows individuals and organizations to extract even more value from information.
These were the ideas put forth by Dr. Tom Coughlin and Mike Alvarado, who presented these concepts to the Wikibon community from a new paper: Angels in our Midst: Associative Metadata in Cloud Storage.
Contents |
What is Metadata and Why is it Important?
Metadata is data about data. It is high-level information that includes when something was done, where it was done, the file type and format of the data, the original source, etc. The notion of metadata can be expanded to include information about how content is being used, who is using the content, and when multiple pieces of content are being used can relevant and valuable associations be observed?
According to the authors, users should think about the different types and levels of metadata - low-level metadata that provide (for example) information about physical location of blocks, all the way to higher level metadata that can go beyond descriptive to include judgemental information. In other words, what does this content mean and is it relevant to a particular objective or initiative? For example, will I like how it tastes? Will it be cosmetically appealing to me?
By leveraging metadata more intelligently, organizations can begin to extract new business value from information and potentially introduce new business models for value creation. Underpinning this opportunity is the relationships between content elements, which the authors refer to as associative metadata.
What's Needed to Exploit Associative Metadata?
According to Alvarado and Coughlin, if we're going to create associative metadata we need an agent that can be an objective observer. Since individual humans often create many of those relationships, software that is intelligent enough to create new metadata around those interactions and relationships is needed to watch what users are doing. The authors refer to this agent as a "Guardian Angel."
Further, to extend the notion of metadata to an even higher, more rich and complex level, the paper puts forth the notion of "The Invisible College," which is a tool for managing the relationships between Guardian Angels and a useful framework for creating a more intricate system of data systems.
What Different Types of Metadata Exist?
On the call, Todd, an IT practitioner put forth a simple metadata model that included three layers:
- Basic metadata - low level data - e.g. block level information about where data is stored and how often it is accessed.
- File-level metadata - more complex data from file systems.
- Content-level metadata - metadata that might be found in content management systems such as file type and other more meaningful data such as: "Is this email from the CEO?" or "Is the mammogram positive?"
Each metadata type is actionable. For example, basic metadata can be used to automate tiering; file-system data can be used to speed performance, and high-level metadata can be used to take business actions. The key challenge is how to capture, process, analyze, and manage all this metadata in an expedient manner.
How Will these Metadata be Managed?
The authors have put forth in their paper a taxonomy for metadata that is very granular but broken down along two high levels:
- Meaning Metadata,
- Basic Metadata.
The paper defines an OSI-like model with a physical layer through operational into semantic and contextual layers. The definition of these layers, their interaction and management, will be vital to harnessing the power of this metadata. A key limiting factor is the raw power of systems and the ability to keep pace not only with user interactions but machine-to-machine (M2M) communications.
The community discussed the possibility of a Hadoop-like framework, Hadoop itself or other semantic web technologies to be applied to evolving metadata architectures. Rather than shoving the data into a big data repository, the idea is to distribute the metadata and allow parallel processing concepts to operate in tandem. By allowing the metadata to remain distributed, massive volumes of data can be managed and analyzed in real or near-real time, thereby providing a step function in metadata exploitation.
Does such a framework exist today specifically designed for cloud metadata management? To the community's knowledge, not per se, but there are numerous open source initiatives such as Hadoop that have the potential to be applied to solving this problem and creating new opportunities for metadata management in cloud environments.
What about Privacy and Security in the Cloud?
Cloud information storage is accomplished by providing access to stored assets over TCP/IP networks, whether public or private. Cloud computing increases the need to protect content in new ways specifically because the physical perimeter of the organization and its data are fluid. The notion of building a moat to protect the queen in her castle is outmoded in the cloud, because sometimes the queen wants to leave the castle. Compounding the complexity of privacy and data protection is the idea that associations and interactions will dramatically increase between users, users-and-machines, and machines themselves.
According to Alvarado and Coughlin, not surprisingly, one answer to this problem is metadata. New types of metadata, according to the authors, will evolve to ensure data integrity, security, and privacy with content that is shared and created by individuals, groups, and machines. For example, metadata could evolve to monitor the physical location of files and ensure that the physical storage of that data complies with local laws that might require that data is not stored outside a particular country.
Location services is currently one of the hottest areas in business but it lacks a mechanism to enable the levels of privacy users desire. The services in the Internet world are often being introduced under conditions where they are outstripping the ability of infrastructure to provide a mature framework for issues like corporate policy or effective policy. Mechanism that are described in the paper can make infrastructure more predisposed to keep new service ideas in sync with necessary protection and other mechanisms which are being added after the fact. If developers had access to a toolkit before they deploy that is easily accessible, we would be ahead of the game.
The authors state the following: "Whatever data system solutions arise, they will have identifiable characteristics such as automated or semi-automated information classification and inventory algorithms with significance and retention bits, automated information access path management, information tracking and simulated testing access (repeated during the effective life time of information for quality assurance), and automated information metadata reporting."
A key concern in the Wikibon community is the notion of balancing information value with information risk. Specifically, business value constricts as organizations increasingly automate the policing of data and information, and striking a risk reduction/value creation balance is an ongoing challenge for CIOs. The bottom line is that the degree of emphasis on risk versus value will depend on a number of factors, including industry, regulatory requirements, legal issues, past corporate history, culture, company status (i.e. private versus publicly traded), and other issues.
When will Architectures and Products Emerge?
Clearly the authors ideas are futuristic in nature, however the value of this exercise is that Alvarado and Coughlin are defining an end point and helping users and vendors visualize the possibility of cloud computing in the context of creating new business models. The emphasis on metadata underscores the importance of defining, understanding and managing metadata to create new business opportunities and manage information risk. The role of metadata in this regard is undeniable and while solutions on the market are limited today, they are beginning to come to fruition in pockets.
Key developments are occurring within standards communities to address this opportunity, and the authors believe that these efforts are beginning to coalesce around the Angel and Invisible College notion. Specifically they cite the evolution of Intel's work on Fodor. As well, clearly mashups using Google mapping capabilities with GPS triangulation extensively use metadata to create new value. Frameworks like Hadoop could be instrumental in providing fast analytics for big data and other semantic web technologies are emerging to address these issues. From a storage perspective, few suppliers are actively talking about this opportunity in their marketing, but several have advanced development projects to better understand how to exploit metadata for classification, policy automation, collaboration, and the like.
Action Item: Initial cloud computing deployments have been accelerated due to the economic crisis, and many have focused on reducing costs. At the same time, numerous organizations are enabling new business models using cloud platforms. These initiatives are creating truckloads of data and metadata, and users and vendors must identify opportunities to both harness and unleash metadata to mitigate information risks and at the same time create new value pathways. The degree to which this is possible will be a function of an organization's risk tolerance, its culture, regulatory compliance edicts, and a number of other factors. Whatever the path, metadata exploitation will be fundamental to managing data in the 21st century
Footnotes: