Jay Rossiter, SVP, Cloud Platform Group, Yahoo
- Attendance >1600 ppl, from >400 companies.
- No longer a West Coast early adopter phenomenon.
- Hadoop isn’t quite mainstream, but almost, not quite at enterprise level purchasing but getting close.
- “Don’t get much more mainstream than Jeopardy” Hadoop was a major engine behind this.
- Big data the hottest topic in data today, aggressive commercialization ongoing.
- Massive amounts of data sitting on floor, opportunity to extract value and differentiate business.
- Seeing # of major companies getting into the game…cited EMC, IBM and Informatica.
- Rapid ecosystem expansion happening…notable differentiation, companies specializing in different parts of infrastructure, tools and applications.
- Players are making markets out of niches.
- Dabber = acquired by Yahoo, does smart ads, uses Hadoop to build models to personalize ad creatives.
- Yahoo retiled maps of U.S. using Hadoop in 5 days, 800% improvement over prior cycle.
Hadoop @ Yahoo
- Hadoop key piece of architecture that runs company.
- “Helps make them the premier digital media company.”
- One of the largest private clouds in world, >200PBs under mgmt.
- 100’s of Web properties across the world.
- Demands placed on Hadoop vary across Yahoo.
- Hadoop is what makes Yahoo personal and relevant.
- Use CORE (content optimization relevance engine) running on top of Hadoop for home page to make it more personalized and increase click-through rate.
- Use Hadoop for anti-spam.
- Use Hadoop in search, to link structured data that is relevant to user search queries.
- Use Hadoop for local geo-tagging and location relevance.
- Display…not just ad targeting but to predict marketplace demand, understand supply and improve efficiency of display marketplace.
- Can link different types of data…link news articles to videos and images and create much more engaging experience, from head to tail.
- Mobile devices…determine content relevant for various form factors, also better understanding on content intent.
Hadoop development started 5 years ago:
- 42k servers.
- 200+ PBs storage.
- 5m+ monthly jobs.
- “Behind every click” in mid-09 to capture the data behind every click and make available to applications and also for classic data analysis
- Now focused on knowledge as a service, “the secret sauce”, take all yahoo content and integrated with rest of yahoo tech.
Yahoo architecture:
- Take classic cloud PaaS (IaaS, compute and storage) and use that to drive the “knowledge of the service layer”.
- Run Hive and HBase at Yahoo, created by other companies.
- Deeply committed to open source.
Hortonworks:
- Dedicated to the adoption and maturation of Hadoop, Apache and open source.
- Will continue to develop around and contribute to this project.
- Yahoo is an investor and partner and major adopter of the technology.
- Will co-develop the next generation of technology with Hortonworks and harden it to turn it into enterprise class offering.
Hortonworks, Eric Baldeschwieler, CEO:
- Focused on revolutionizing and commoditizing the storage and processing of big data via open source.
- Believe 50% of world's data will be stored in Apache Hadoop in next 5 years.
- Strategy is to grow and enable the Apache Hadoop ecosystem, holding nothing back, will be complete product offering.
- Highest concentration of AH committers, >70% of Hadoop code, delivered every major stable release since the beginning.
- Business operations led by Rob Bearden, former COO SpringSource & JBoss.
- Brings team of open source execs.
- Backed by Benchmark and Yahoo!
- Yahoo is a development partner.
- Yahoo has >1000 active users of Apache Hadoop.
- Yahoo will continue to contribute.
- Yahoo is a customer, Hortonworks will provide level 3 support and training to Yahoo.
- Hortonworks taking core architects and contributors.
- “When it gets really interesting they can call us”.
- Yahoo is an investor.
- Release engineering and collaboration will continue through Apache.
Current state of adoption:
- Huge amount of interest in Hadoop, early adopters using it everywhere.
- Major service provider…seeing Hadoop in all major fortune 2000 accounts”.
- Not easy to manage, requires expensive talent or consultants.
- Not a lot of 3rd party support.
- Knowledge and tech gap.
- Hadoop still very young, lot of ways it could be better…lot of things keeping it from popping.
- Committed to make Hadoop easier to install, manage, and use.
- Needs to be more robust, high performance, and available.
- Needs to be easier to integrate and extend, will focus on opening up APIs to enable experimentation and integration.
- “Anyone should be able to easily deploy the Hadoop projects directly from Apache”.
- Will do development in apache, complete transparency, all code goes back to Apache, “no ifs, whats or buts”.
- Apache Hadoop has potential to be game changing but to do this need to come together as a community.
- Already in the middle of Phase 1.
- Make Apache Hadoop accessible.
- Release most stable version of Hadoop ever.
- Frequent sustaining releases off of the stable branches.
- Phase 2 – next generation Apache Hadoop (2012, alphas starting Oct 2011).
- Address product gaps (Hbase support, HA…eliminate all SPFs – can make huge progress on this, management = more sophisticated mgmt tools).
- Enable community and partner innovation via modular architecture and open APIs.
- Work with community to define integrated stacks…want to test versions of all Hadoop tools together and release them.
- Next Gen.
-Core.
- HDFS federation...allows HDFS to support much larger clusters, new APIs to enhance experimentation like support HBase on HDFS.
- Next gen Map Reduce.
- New write pipeline (HBase support).
- HA (no SPF) and wire compatibility…easier for clusters to interop.
- Data.
- Pig, hive, MR and streaming as clients.
- Creating set of APIs to allow all clients to operate on same data on Hadoop.
- HDFS and HBase as storage systems.
- Integrate Hbase in as a backing code, write once and not worry about where stored.
- Performance and storage improvements.
- Management & ease-of-use.
- All components tested and deployed as stack.
- Stack install and centralized config mgmt.
- REST and GUI for user tasks.
Q&A
- Will sell support for Apache Hadoop, not going to sell a version.
- Want to do this in partnership.
- Product much more about indirect offerings of training and services.
- Lot less conflict than anyone has every characterized it as, HW the driving force behind all the stable releases.
- Conflict is mostly a misunderstanding.
- Yahoo has and will continue to find and fix a lot of bugs in Hadoop before anyone else sees them.
- Things Yahoo wants are things the whole community is going to want…Yahoo just finds them first,
- Yahoo will contribute directly to Hadoop.
- Cloudera.
- “There is room for a lot of people to proceed.”