Web scale storage enables hundreds of terabytes of data across thousands of disks on thousands of computers to be concurrently accessed by countless clients. A long-time goal for many, the reality of web scale storage has arrived. Google has successfully enabled petabytes of previously "unstructured data" to behave almost like "structured data." Google has mastered tags, indexes, and retrieval keys for near-instantaneous searches on a global basis like no one before them - a significant transformation and shift in the storage paradigm. This is web-scale storage in its highest form. What is happening in the storage world to make this a reality?
The answer lies with 1) the Google File System (GFS) and 2) clustered storage. The secret sauce for Google is GFS, a proprietary scalable distributed file system developed by Google for massive distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients on a global basis. GFS data is stored in very large, even multiple gigabyte+ files which are rarely deleted, overwritten, or shrunk and the files are usually appended to or read with not many updates. GFS is also designed and optimized to run on Google's computing clusters where the nodes are inexpensive, "commodity" computers. Commodity means precautions must be taken against the higher failure rates of individual nodes and to prevent the potential impact of data loss. No other system has ever supported as many concurrent global users and as much data with mostly a few seconds maximum response times over a network.
As Google states about GFS “We treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system. Our system provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery.”
The other concept assisting web scale storage is the result of clustering, a popular server concept that is being extended to storage subsystems. Clustered storage is a networked storage system that allows users to add compute nodes, all of which access the same pool of data. Arrays work together as an intelligent team, capable of running on their own and communicating with other arrays to deliver data in response to user needs. Clustering provides massive throughput because of the increased connectivity that comes from cobbling many storage servers together into a single pool of disks and processors, all working on a similar task and all able to share the same data. Clustered storage falls into two categories: systems that combine block-based data on a storage-area network (SAN), and those that create a common file name space across NAS filers. NAS clusters are the most common and have been available from NAS vendors for years. With clustered storage, essential storage management functions are distributed across the storage server farm. Storage capacity can be added without disrupting applications running on the cluster.
--Fmoore 12:13, 14 December 2007 (CST)
Action Item: Organizations requiring clustered storage today should be prepared to build it themselves or experiment with upstarts. Storage clustering, a predecessor to grid storage, has generated increased industry discussion lately as more businesses need web scale storage, but few market leaders have fully embraced the concept. Nonetheless, clustered storage brings IT closer to grid storage, something that is still several years away.
Footnotes: