Contents |
Introduction
Last month at Amazon’s AWS re:Invent user conference (see full coverage here), 9,000 attendees gathered to dig deep into the top cloud computing offering in the marketplace. Amazon Web Services (AWS) is very secretive about its business; everything from revenue to the underlying infrastructure of AWS are not disclosed since, as AWS SVP Andy Jassy told analysts attending the conference, “customers don’t care about this”. So while no one is allowed to tour the AWS data centers, we were given some rare insights into the methodologies and philosophy of design by AWS VP & Distinguished Engineer James Hamilton (see his interview on theCUBE from re:Invent).
Every day, AWS adds enough new server capacity to support all of Amazon’s global infrastructure when it was a $7B annual revenue enterprise.
Where specialization meets scale
Scale matters, but AWS is not an undifferentiated collection of commodity gear. Hamilton said that 10 years ago he believed that architecture should be a giant collection of commodity gear where software provides most of the value. He now believes that this thinking is wrong and that it is through hyper-specialization that Amazon can continue to deliver innovation. The scale of AWS S3 is trillions of objects delivering over 1.5M requests per second. Not only is the scale massive, but for it to be predictable, Amazon DynamoDB consistently delivers 3ms average latency across all APIs. When asked about Facebook’s methodology discussed at the Open Compute Summit earlier this year, which is to standardize on five compute configurations, Hamilton said “I have many more configurations that that now and will have even more next year.” Adding more configurations does have more overhead, but at scale, it is better than just having a handful of configurations.
While this message runs counter to the discussion that large public clouds save money through homogeneous deployments that reduce operational costs, Hamilton points out that AWS is not a typical data center:
- General market offerings must work in a wide range of data center environments; AWS solutions are optimized for specific, well known data center parameters.
- While many solutions are built for specific application requirements, AWS builds each application to a scale that is unmatched and therefore doesn’t lose economies of scale.
- Amazon designs and integrates the entire solution, hardware, software, and datacenter
Compute is our density
Amazon is a large consumer of Open Source Software (OSS) but is not a public contributor. James Hamilton is himself a large proponent of OSS initiatives, and in his presentation at re:Invent, he discussed the advantages of using commodity hardware. For the compute layer, Hamilton said that while a rack of Quanta servers weighs ¾ ton (up to 600 disk drives in a 42U rack – this matches the densest commercially available architectures for Hadoop), Amazon’s configurations are even denser at over 1 ton per rack! AWS added a number of flash optimized instances. Recent industry figures show that ODM servers like those used at Amazon make up a sizable portion of the marketplace ($783M in 3Q13 which was 45% y/y growth). Amazon is not content to simply take components off the shelf; Hamilton stated that it has two engineers working solely on server power supplies where redesigns that are pennies cheaper or a fraction more efficient translates into huge savings.
Hyperscale storage paradigm
While S3 may be the largest storage array in the world, it is made up of all compute-resident disk and flash. Server-based storage architectures can also be designed for the enterprise]]. Large cloud providers have used “Distributed DAS” architectures for many years and they have been adding more features into the solutions. Service providers and enterprise accounts need scalable solutions (although not the same order of magnitude of Googlezon) that are more feature rich and don’t require a team of PhDs. ScaleIO (acquired by EMC) fits into this new category of solution; here’s a blog about a 1000 node configuration. The best fits for compute based storage solutions are for test environments and larger scale configurations. As discussed in VMware VSAN vs the Simplicity of Hyperconvergence, the overhead of building, testing, optimizing and supporting this sort of architecture makes the total cost more expensive for smaller configurations.
Networking becomes just another programmable component
From a networking perspective, Hamilton shared that AWS uses custom routers and protocol stacks. He is publicly supportive of white-box networking solutions and even wrote a blog post about Cumulus Networks bringing Linux to the networking world. By using merchant silicon, networking can follow a path similar to Moore’s Law, leading to lower costs and especially at scale a non-linear growth (networking is one of the few resources that does not typically get cheaper at larger volumes). Since Amazon builds its own devices and stack, it can make fixes in a day that would otherwise take months if they had to wait for a vendor to spin code. Amazon’s network and every service are heavily monitored so that every metric can be tracked.
Utilization Gap
When building any infrastructure, you pay for the peak but only monetize the average. In a typical data center, even with a heavily virtualized environment, getting 30% utilization is great. Cloud methodology is to combine non-correlated workloads over infrastructure at scale so that the law of large numbers allows the difference between the peak and average workloads to shrink. Amazon’s low margin “cycle of innovation” is to iterate on this path:
- Innovate,
- Listen to customers,
- Drive down costs & improve processes,
- Pass on value to customers & re-invest in features.
Action Item: Scale matters, and not all clouds are created equal. Amazon Web Services is continuing to innovate in secrecy in an attempt to keep ahead of contenders to its cloud leadership. CIOs need to pay attention to the hyperscale players, which herald the direction of technology. The forecast of public versus private cloud usage over the next few years is hotly debated, but it is without doubt that infrastructure designs and operational models are seeing seismic shifts, and Amazon is a key disruptor.
Footnotes: Stu Miniman is a Principal Research Contributor for Wikibon. He focuses on networking, virtualization, converged infrastructure and cloud technologies. Stu can be reached via email (stu@wikibon.org) or Twitter (@stu).