Storage Peer Incite: Notes from Wikibon’s December 11, 2007 Research Meeting
This week Wikibon presents Clustered storage mashup. While not a new idea (it goes back to the VaxClusters in the early 1980s), clustered storage has gained a great deal of notoriety with the publicity surrounding the huge clustered systems built by Amazon.com and Google. These very public and highly successful implementations provide a constant demonstration of how relatively inexpensive (as low as $1/GB of storage) "mashups" of independent, lower performance sets of disks, sitting behind small, cheap servers, can provide "good enough" performance for very large sets of unstructured, end-user created data such as Web search results and shopper Web click-throughs. And while it does not produce subsecond response, that level of performance is not needed for the application.
What makes this more than an academic interest is that increasing numbers of organizations worldwide, in virtually all verticals, are discovering the need to maintain active archives of huge amounts of unstructured, end-user generated "data." At the moment the focus is on email and driven by recent regulatory actions across the First World. But the expectation is that this will expand to include IMs and even recordings of all business telephone calls. In the United States, for instance, financial companies are now required to capture and archive all communications with clients of any kind. And companies are finding increasing value in being able to analyze this mountain of unstructured data. But high performance systems are simply too expensive to handle all this data, and they are also overkill -- subsecond seek times do not add value to these applications. Also, these huge collections of data require a new approach to backup and restore. There simply is too much data to ever restore the database from tape. Clustered storage on the Amazon.com/Google model seems to offer solutions to all of these problems, and therefore it looks like the coming technology, not to replace high performance transactional systems but to augment them and handle a different kind of data economically but with adequate performance and security from disk failures. Bert Latamore
Clustered storage technologies have been available for several decades. However, the publicity surrounding the inexpensive, highly scalable clustered storage architectures implemented by Web giants Google and Amazon to support their specialized storage requirements has created new excitement about clustered storage technologies. This interest will only gain intensity in mid-year 2008 when EMC is expected to enter this emerging market with a combination of products code named Hulk (hardware) and Maui (software).
The fundamental feature of clustered storage is its emphasis on storage capacity scalability for unstructured storage and cost optimization of those storage spaces using lower cost components than are normal in large arrays. The trade off is lower seek and I/O per second performance.
The cost of storage on large arrays is as high as $20/GB, while on some clustered systems it is as cheap as $1/GB. This price differential, high scalability and lower read performance makes it an attractive solution for storing the explosion of unstructured data created by users of organizations of all sizes.
As regulatory and business requirements for keeping large quantities of data evolve over the next few years, the need to support “Webscale” storage spaces will only grow in many organizations. The key question is not if clustered storage systems represent a viable technology set but rather how users will gain access to the differentiated price points associated with those systems. Some users no doubt will encounter the need for Web-scale computing faster than others, driven by particular business characteristics that feature greater growth in unstructured data. These users are likely to deploy clustered systems alongside more traditional technologies, including very large arrays. Other users may not encounter the same pressure as quickly and may turn to alternative means to source the pricing benefits of clustered storage, perhaps through exploitation of storage-related services from storage hosting companies.
A key challenge users will face as they put forward plans to appropriately deploy their storage investments will be to identify the right degree of scale in their various storage and application needs to ensure they are getting the right storage technology for the right price. This is where EMC's expected announcement becomes interesting. EMC is likely to present a coherent vision of the marriage of controller-oriented arrays for high throughput and clustered storage arrays for applications with high scalability but lower data access speed requirements. It will take a few years for users to develop the right rule-of-thumb to envision the appropriate mix of storage technologies, but it is increasingly evident that clustered storage will emerge as an important option.
Action item: Users should gain familiarity with clustered storage technologies, and specifically the different price points, scalability, and performance levels they may offer, and start to use classification and other strategies to discretely organize their applications, data, information, and storage needs to take advantage of clustered storage technologies as they gain favor through multiple sourcing options.
Users and organizations are striving to write applications that can create value from a vast panoply of data objects that reside within an organization, shared within ecosystems that the organization belongs to, and shared across the Web. As Dave Vellante argues in his piece "Storage to get a slice of Google pie", the key question is “… what are the requirements and characteristics of storage that will support this opportunity and allow companies to compete by putting in place cost effective, massive Web-scale (versus enterprise scale) infrastructure to support these business models?”
The Web 3.0 applications that exploit Web-scale computing will be as varied and different as the organizations and people that create them. There are opportunities for many different architectures and approaches for Web 3.0 storage. Google has implemented the first and largest instantiation of a clustered storage network and is used here as a reference model. Some of the common characteristics that Web 3.0 storage is likely to have are projected as follows:
- Self healing versus backup and restore:
- The business recovery mechanisms in place in the majority of data centers are based on a master copy of transactional data residing on unreliable storage hardware located in one data center. As disks get larger the time to recover a disk gets longer (Disk capacity is doubling every year, but transfer rates are improving very slowly--4%/year), recovery becomes a growing problem. For unstructured data this model makes neither technical nor economic sense.
- The Google File System (GFS) spreads the data across the different systems in different locations and appends rather than updates in place. Data is not backed up for recovery purposes. GFS assumes that there will be failures of all system and software components and recovers automatically from any failure.
- Projection: Web 3.0 unstructured data will not rely on traditional backup and recovery mechanisms. A file system will be integrated with storage and handle all recovery operations automatically.
- Commodity components vs. specialized arrays:
- Storage in high performance arrays costs $20/GB. The raw cost of storage is less than $1 per GB.
- Google builds its own systems with storage included to produce the world’s largest storage infrastructure at an order-of-magnitude lower price than traditional arrays. Functionality is built into the GFS and can be ported quickly to new generations of hardware as soon as they become available.
- Projection: Web 3.0 clustered storage networks will be built on commodity hardware by multiple vendors, and significant storage management and storage functionality will reside in the file system.
- Clustered storage network vs. SAN
- SANs have been very successful at improving accessibility to data and better utilization of storage. Managing remote storage using traditional methods such as EMC SRDF and Hitachi UR is very expensive and increases storage costs to the $60-$100 per GB range.
- Google has built the storage infrastructure on the assumption that it is distributed, and automatically protects, moves or caches data locally to optimize the end-user response time and experience.
- Projection: Web 3.0 clustered storage network will be built assuming that data can be anywhere on an IP based network.
- Open vs. proprietary standards
- Vendors will be tempted to want to control the APIs for application access to clustered file systems, and there will be strong pressure within EMC to do the same with Hulk and Maui.
- Projection: Any such attempt will fail because of Metcalfe’s law (the value of a network is proportional to the square of the number of users of the system), and open standards will emerge
Action item: Profound technological change is coming to the provisioning and exploitation of storage. IT should assign its best and brightest to design applications that will exploit the network of data and push standards bodies to create access standards.
Clustered storage is touted as being able to support a single storage space spanning multiple, widely dispersed data centers (i.e., beyond 100km distances). If this benefit matures as suggested by leading proponents of clustered storage (e.g., low-cost, simple interconnects, open protocols, standard management conventions), then not only will the storage product hierarchy be rewritten, but the mechanisms for sourcing storage capacity will be revolutionized.
In addition to buying storage capacity, options to rent general-purpose storage capacity for a broad class of applications are likely to evolve. Today, these options are limited to applications like mobile PC backup/restore (e.g., Connected, Mozy). Soon, organizations requiring capacity may be able to quickly and seamlessly enfranchise gigabytes from commercial suppliers of clustered storage "hosted services." However, if these services mature, storage administrators will have to learn a host of new sourcing tricks related to capacity renting (e.g., when to rent, how much to rent, what data to put on rented capacity, service levels associated with renting). Networking professionals, whom already have mastered many of the sourcing complexities associated with buy/rent decisions, will be a great source of sourcing insight to storage professionals as the market for clustered storage diversifies.
Action item: To achieve economic scale for clustered storage implementations, both vendors and users of clustered storage infrastructures will explore a variety of go-to market options, including capacity renting. Standards for technology, packaging, and services will be keys to successful, broad-scale clustered storage markets.
Everyone wants a piece of Google's success. As unstructured and semi-structured data grow to comprise more than 80% of corporate information, all types of organizations in virtually every industry are trying to figure out how to exploit Web 2.0 and apply so-called 'Cloud Computing' models to monetize enormous volumes of Web data and user interactions. Indeed, why let Google have all the fun (and profits)?
In the book Wikinomics, Authors Don Tapscott and Anthony Williams make the case that new communications and collaborative Web technologies are democratizing value creation. Specifically, the authors posit that increasingly, organizations will open up their IP to catalyze innovation by thousands or even millions of collaborators unleashing unprecedented value. This realization combined with the confluence of massive data explosions, the emergence of Google-like business models and a flood of inexpensive software technology are presenting huge opportunities for organizations around the world.
The question is, what are the requirements and characteristics of storage that will support this opportunity and allow companies to compete by putting in place cost effective, massive Web-scale (versus enterprise scale) infrastructure to support these business models? Here are some of the more obvious requirements and differences from today's enterprise storage:
- Petabyte versus terabyte scale,
- Millions versus thousands of users,
- Thousands of network nodes versus hundreds of servers,
- Highly distributed versus command and control,
- Self-healing versus backup-and-restore,
- Auto-classified versus admin-classified data,
- Intelligence in the data versus intelligence in the application,
- Semantic building blocks versus discreet files and records,
- Cheaper than dirt and simple to operate versus really expensive and complex to manage.
The predominant use case today for this type of storage infrastructure is the Google File System where globally distributed data are stored, indexed, searched and rapidly retrieved by one billion Internet users, supported predominently by advertising. Other examples include Facebook and Wikipedia, which provide a platform for users to generate their own content, creating massive repositories of information.
However these are only three examples. More mature models like eBay are evolving, and new models are emerging within the telco and managed service provider spaces, as well as other online businesses, where users are enticed with free services or content and then offered incremental value for pay. By introducing transactionality into the equation, these businesses are blurring the lines between structured and unstructured data, creating incremental demands on resiliency, state and performance.
In addition, organizations are researching so-called Web 3.0 models where relatively small applications interact with each other and access data that resides on the Web or in a 'cloud.' These applications tend to be highly customizable, very fast and available on a variety of mobile devices. As well, they contain richer media and observers can expect bandwidth requirements to continue to escalate with orders-of-magnitude more bandwidth than today's so-called Web 2.0 applications. Importantly, Web 3.0 applications are expected to leverage quasi-artificial intelligence in a fashion that involves human interaction and collaborative filtering (e.g. Flickr, Digg and collaborative search engines) to enable mashups of seemingly disparate content placed in a user context. To get a sense of what these storage capabilities will look like check out David Floyer's Peer Incite Web 3.0 clustered storage networks.'
Clearly not all organizations will build out the storage infrastructure to support these applications themselves, and many will source these capabilities from service providers. What is clear that the vast majority of organizations, large and small, will participate in some way in this emerging marketplace.
Action item: Organizations must begin to document the market requirements for evolving and emerging Web businesses and understand the parameters, constraints and opportunities presented by them. Storage infrastructure product requirements should evolve from this thinking supported by a new class of Web scale storage products from both established players and new entrants.
Context: EMC's support of cluster storage for archiving and backup will legitimize the technology and bring instant attention to the market. Vendors with competitive products have a window of opportunity to position themselves as a superior alternative. However, make no mistake, EMC plans to own this space and will commit significant financial, R&D and marketing resources to the effort.
EMC's market entry will be hobbled by several problems that competitors can try to exploit.
- Immature software: limitations, bugs, and the evaluation cycle that implies
- Maintaining a bright line positioning between Hulk/Maui and Symms
- 60% gross margin requirement
EMC will be NDA'ing strategic customers starting in mid-January to build major sales to reference at announcement. Smart customers will be calling other vendors, including the smaller, innovative ones, for perspective. Luck favors the prepared and receiving customers after a trip to Hopkinton will be a tough act to follow, despite the cold New England weather.
Strategy: IBM, Hitachi, HP, NetApp IBM Global Services should be open to reselling/integrating suitable substitutes. There are efforts within IBM's storage group to create a scalable, commodity storage infrastructure, but the chasm between IBM's brilliant technologists and IBM marketing makes success problematic.
Hitachi does not seem to be doing anything in this area. It will be looking at an acquisition, and it will probably bide its time.
HP's Polyserve acquisition may convince the company that it has the cluster thing under control, but Polyserve is not competitive with EMC's initiative. HP has a deep well of technology expertise from the DEC cluster products. Expect a cluster acquisition in 2008.
NetApp is vulnerable. ONTAP GX has missed the cluster market, and NetApp's controller-based architecture has all the cost disadvantages of traditional arrays without the flexibility of clustering. Putting ONTAP 7G on commodity hardware bricks with software "mortar" - as Google does with GFS - would preserve its significant advantages with WAFL at a lower $/GB.
New competitors Now is the time to get serious about what your product really does and what its appeal is to customers. Focus is critical to building a defensible position that can be used to win F500 business in areas where EMC is less competitive. There is also an opportunity to shift the terms of the customer debate. This market is still fluid and customers don't have a clear mental map of the terrain. Smart, focused marketing can take advantage of that.
Small/new vendors: if you want to be acquired, now is the time to be shopping yourself to the big guys. If you want to build a big business, get your marketing focused on verticals and business justification.
Big vendors: start shopping now. EMC wants your scalp, so you'll want to be well-armed.
All: there is a lot more to know about Hulk/Maui. A focused competitive analysis effort will pay dividends.
Web scale storage enables hundreds of terabytes of data across thousands of disks on thousands of computers to be concurrently accessed by countless clients. A long-time goal for many, the reality of web scale storage has arrived. Google has successfully enabled petabytes of previously "unstructured data" to behave almost like "structured data." Google has mastered tags, indexes, and retrieval keys for near-instantaneous searches on a global basis like no one before them - a significant transformation and shift in the storage paradigm. This is web-scale storage in its highest form. What is happening in the storage world to make this a reality?
The answer lies with 1) the Google File System (GFS) and 2) clustered storage. The secret sauce for Google is GFS, a proprietary scalable distributed file system developed by Google for massive distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients on a global basis. GFS data is stored in very large, even multiple gigabyte+ files which are rarely deleted, overwritten, or shrunk and the files are usually appended to or read with not many updates. GFS is also designed and optimized to run on Google's computing clusters where the nodes are inexpensive, "commodity" computers. Commodity means precautions must be taken against the higher failure rates of individual nodes and to prevent the potential impact of data loss. No other system has ever supported as many concurrent global users and as much data with mostly a few seconds maximum response times over a network.
As Google states about GFS “We treat component failures as the norm rather than the exception, optimize for huge files that are mostly appended to (perhaps concurrently) and then read (usually sequentially), and both extend and relax the standard file system interface to improve the overall system. Our system provides fault tolerance by constant monitoring, replicating crucial data, and fast and automatic recovery.”
The other concept assisting web scale storage is the result of clustering, a popular server concept that is being extended to storage subsystems. Clustered storage is a networked storage system that allows users to add compute nodes, all of which access the same pool of data. Arrays work together as an intelligent team, capable of running on their own and communicating with other arrays to deliver data in response to user needs. Clustering provides massive throughput because of the increased connectivity that comes from cobbling many storage servers together into a single pool of disks and processors, all working on a similar task and all able to share the same data. Clustered storage falls into two categories: systems that combine block-based data on a storage-area network (SAN), and those that create a common file name space across NAS filers. NAS clusters are the most common and have been available from NAS vendors for years. With clustered storage, essential storage management functions are distributed across the storage server farm. Storage capacity can be added without disrupting applications running on the cluster.
Action Item: Organizations requiring clustered storage today should be prepared to build it themselves or experiment with upstarts. Storage clustering, a predecessor to grid storage, has generated increased industry discussion lately as more businesses need web scale storage, but few market leaders have fully embraced the concept. Nonetheless, clustered storage brings IT closer to grid storage, something that is still several years away.
Eighty-five percent of storage spinning in the data center is unstructured. Most IT organizations have storage infrastructures and processes that reflect the optimum way of providing good response times, throughput, availability and business continuance for tier 1 transactional systems. Applying those practices to unstructured systems is overkill.
Storage in high-performance arrays (such as EMC DMX, Hitachi USP, IBM 8300) cost ~$20 per GB. Modular storage costs $10-15. New entrants such as 3PAR have slightly lower costs but use the same model of proprietary hardware and software. The raw cost of storage is less than $1 per GB. Clustered storage arrays, based on commodity storage and system components, can have much lower price points than high-performance arrays, as Google, Amazon and others have demonstrated. These clustered storage systems will have different functionality than traditional arrays that will facilitate the execution and operation of Web 3.0 applications.
One of the biggest challenges to IT is developing a strategy for managing unstructured data. This data represents both a potential liability from litigants and a source of value to the organization.
One thrust for managing unstructured data has been to try to develop classification taxonomies, either by user input, by automatic classification inference methods, or from available metadata. Another thrust has been to remove duplicate data to lower the cost of storage. The success to date of these approaches has been mixed.
Another approach that may have broader applicability is to simply use very low cost clustered storage systems and use search and classification based on how data is accessed. This is a much simpler way to extract value from the data and much easier to implement. For users it would be a natural extension of their web experience.
Pooling data and creating an ecosystem with partners, suppliers and customers could further enhance the value of unstructured data. Combining this with data from the Web would lead to natural extensions of these search applications and the development of new ones, dynamically adapting to how users exploit the data to drive business value rather than using a preset classification scheme.
Action item: IT organizations should continue to invest in traditional arrays for transactional systems. IT organizations should continue to implement systems that reduce the major risks that may exist in unstructured data with simple archive systems. However, IT organizations should minimize their investments in the exploitation of unstructured data and make no major changes in unstructured data management until distributed clustered storage networks become generally available for use in 2008 and onwards.