Users and organizations are striving to write applications that can create value from a vast panoply of data objects that reside within an organization, shared within ecosystems that the organization belongs to, and shared across the Web. As Dave Vellante argues in his piece "Storage to get a slice of Google pie", the key question is “… what are the requirements and characteristics of storage that will support this opportunity and allow companies to compete by putting in place cost effective, massive Web-scale (versus enterprise scale) infrastructure to support these business models?”
The Web 3.0 applications that exploit Web-scale computing will be as varied and different as the organizations and people that create them. There are opportunities for many different architectures and approaches for Web 3.0 storage. Google has implemented the first and largest instantiation of a clustered storage network and is used here as a reference model. Some of the common characteristics that Web 3.0 storage is likely to have are projected as follows:
- Self healing versus backup and restore:
- The business recovery mechanisms in place in the majority of data centers are based on a master copy of transactional data residing on unreliable storage hardware located in one data center. As disks get larger the time to recover a disk gets longer (Disk capacity is doubling every year, but transfer rates are improving very slowly--4%/year), recovery becomes a growing problem. For unstructured data this model makes neither technical nor economic sense.
- The Google File System (GFS) spreads the data across the different systems in different locations and appends rather than updates in place. Data is not backed up for recovery purposes. GFS assumes that there will be failures of all system and software components and recovers automatically from any failure.
- Projection: Web 3.0 unstructured data will not rely on traditional backup and recovery mechanisms. A file system will be integrated with storage and handle all recovery operations automatically.
- Commodity components vs. specialized arrays:
- Storage in high performance arrays costs $20/GB. The raw cost of storage is less than $1 per GB.
- Google builds its own systems with storage included to produce the world’s largest storage infrastructure at an order-of-magnitude lower price than traditional arrays. Functionality is built into the GFS and can be ported quickly to new generations of hardware as soon as they become available.
- Projection: Web 3.0 clustered storage networks will be built on commodity hardware by multiple vendors, and significant storage management and storage functionality will reside in the file system.
- Clustered storage network vs. SAN
- SANs have been very successful at improving accessibility to data and better utilization of storage. Managing remote storage using traditional methods such as EMC SRDF and Hitachi UR is very expensive and increases storage costs to the $60-$100 per GB range.
- Google has built the storage infrastructure on the assumption that it is distributed, and automatically protects, moves or caches data locally to optimize the end-user response time and experience.
- Projection: Web 3.0 clustered storage network will be built assuming that data can be anywhere on an IP based network.
- Open vs. proprietary standards
- Vendors will be tempted to want to control the APIs for application access to clustered file systems, and there will be strong pressure within EMC to do the same with Hulk and Maui.
- Projection: Any such attempt will fail because of Metcalfe’s law (the value of a network is proportional to the square of the number of users of the system), and open standards will emerge
Action Item: Profound technological change is coming to the provisioning and exploitation of storage. IT should assign its best and brightest to design applications that will exploit the network of data and push standards bodies to create access standards.
Footnotes: