This is a near-transcription of an October briefing from Hortonworks CEO Eric Baldeschweiler, done at the time of its announcement of a partnership with Microsoft. In this briefing, he discusses the philosophy and background of Hortonworks and its core team, its vision for creating partnerships with multiple vendors to help them build on Open Source Apache Hadoop, and on his view of the future of Hadoop as well as traditional structured databases, and how those will work together. Attending the briefing were:
- Eric Baldeschweiler
- Jeff Kelly, Wikibon
- David Vellante, Wikibon
- Bert Latamore, Wikibon
EB: We're really committed to building out Apache Hadoop and doing it in the Open Source community, so what really differentiates us is being really committed, besides shipping 100% pure Apache Hadoop code, which nobody else does, to taking a very partnering ecosystem-centric approach. So it's our belief that for Hadoop to be everything we think it can be, and we think it will be huge – we think that half the world's data will be on Hadoop in five years – we need to grow the ecosystem. To do that we think we need to make Hadoop easier to use, we need to provide training & support, & we need to provide the bridges between Hadoop & as many other ecosystems as we can.
So today's relationship with Microsoft is an example. We are announcing a development partnership with Microsoft to work with their engineers to develop changes to Apache Hadoop that will make Hadoop better in general and also make Apache Hadoop run much more effectively on Windows. That's obviously Microsoft's focus, and what we bring to that is the ability to help them with the Apache Hadoop process.
Before the founding of the company, I was running the Hadoop team at Yahoo. I've done that since the start of 2006. Before that I worked on Web search for Yahoo. Before that I was developing software at Inktomi, which was a 2003 acquisition. So the majority of my work in Web search was building big data systems for processing all the pages on the Internet to build Yahoo search. Back in 2006 I was the chief architect of Web searching, and we decided to embrace the Hadoop initiative to help us build Web-search platforms. We quickly discovered that the science teams in Yahoo could make much wider use of Hadoop. We saw options for using it in e-mail spam detection, in advertising, in home-page customization, etc. Today Hadoop is used widely across Yahoo. They run a service with more than 1,000 active internal users, running on over 40,000 machines. A tremendous amount of science work to build product and production work to inform the results you get from Yahoo every day.
This year we decided to take an investment from Benchmark Capital and an investment from Yahoo to found Hortonworks. We took 22 people from Yahoo including myself, the core committers & architects who helped us take Hadoop from the portotype it was in 2006 to what it is today. So we have a strong core team of engineers. We set out to grow the Hadoop community & to develop training & support as a foundation for Hortonworks.
JK: You said you are the only organization focused on shipping 100% Apache Hadoop code. You don't consider others?
EB: We're the only ones committed to shipping Apache Hadoop code. We've been the drivers behind every major release of Apache Hadoop since its inception. Other companies are packaging and distributing Hadoop, but when they do that they add lots of their own custom stuff, both as patches to the Apache Hadoop distribution and also as independent products. A lot of that work is going into Apache, and since we committed to the Open Source model we've seen a lot more third party code going into Apache, which is obviously a win for the community. But to date no other company is actually taking releases from Apache & supporting them. They create their own versions that are slightly different from what comes from Apache, and try to build a business around that.
JK: We think from talking to some of your upcoming partners that we have a handle on this, but we'd like to hear from you directly: How else do you differentiate in the marketplace?
EB: As I discussed, we're completely committed to an Open-Source business model, while the other companies are committed to selling proprietary software. Beyond that we have unparalleled deep domain expertise in Hadoop. That gives us an opportunity to form a unique set of partnerships and allows us to focus on growing the community. So we're a very long play. We're not going to fight for the small number of production Hadoop accounts out there today. We're interested in growing the pie so that instead of dozens of people using Hadoop the way there are today, we'll grow that out to tens-of-thousands of people in a few years. We're focused on growing the ecosystem via partnerships. So the Microsoft partnership is an example of what we're looking to do – work with people who can help us take Hadoop to their customer bases. We have a number of other relationships we hope to announce soon. As we do that, you will see that we are talking to people across the spectrum – platform providers like Microsoft, OEM hardware providers, Cloud providers, ISVs, folks who build software that integrates with Hadoop, systems integrators, etc.
JK: So your monitization strategy would be to commit to support and training and other services as opposed to licensing what you're calling proprietary software?
EB: That is our strategy for now. Down the road we may build other software offerings with another licensing model, but we are committed to always ship Hadoop for free. The Apache Hadoop platform should be complete and free. We are not shipping parts of it for free but then saying if you want to go into production you will need these other pieces that are proprietary. We want people to be confident that they can build their whole application for free, and if they want to engage us for training and support, we'll be available.
DV: The feedback we've had from potential partners is that they really like this message. What you're saying to us is really clean & clear. Some of the others, Cloudera in particular, are very unclear to us. They seem to be going in different directions. Maybe that's because they're growing so fast, or maybe because they now have competition, we're not sure. But it's causing confusion in the ecosystem. I now you know this; you're closer to it than we are.
EB: We want to do everything we can to un-confuse the ecosystem, but it is our biggest concern, the fragmentation in the market. Which is why we are encouraging everyone to commit to Apache Hadoop as the standard.
JK: How does that work with partners who want to inject their own proprietary tools as potential add-ons to your distribution? For instance, we were talking to Informatica earlier. They are coming out with a tool shortly that they are keeping closed source.
EB: It's pretty clean. Lots of people sell applications on top of Linux as well. We believe that Hadoop, the platform, should be free. And to grow the ecosystem we want as many vendors to come in with as many solutions as possible. Closed source, open source, we expect a variety of both. We're focused on evolving Hadoop so it is extensible, so if people want to bring value-add differentiation, it can be done well using the Open Source foundation. So once Microsoft has Hadoop on Azure, Azure is a very differentiated service. But we want them to use the same Hadoop everyone else is.
JK: What are some of your other go-to-market strategies, particularly to reach more traditional enterprises?
EB: All the obvious marketing channels. We want to provide all the training & education in Hadoop that we can, and we intend to build up strong offerings there. Also as I said, we think partnering is key. We want to work with people who already provide services to the enterprise customer base & come in with a Hadoop offering with them. So rather than build out a large SI practice, our vision is to work with people who already have large groups of enterprise customers & help them with their Hadoop problems with our solution.
JK: How would you characterize the state of the Apache Hadoop distribution as it stands now?
EB: Hadoop is certainly still young software, and many things can be done. I will happily talk about how Hadoop will be better later, and it will be better later, but lots of folks in the market are trying to differentiate by spreading FUD about Hadoop. Hadoop is ready for a set of enterprise applications. Yahoo, Facebook, Ebay, etc., lots of Internet companies, are betting their businesses, and lots of real dollars are flowing through their Hadoop installations. There's been successful installations at banks & transaction processing centers at “real” enterprises as well, not just Internet companies. So there's a diversity of successes. The challenge is it is not a panacea. You need to understand what problem you are going to do with it & be sure it is a fit. Then build up the right organizational competences to use the technology. And today I would say you need training and probably need to work with an SI partner to build the Hadoop solution. So it's not trivial to use, but our goal is to make it much easier to use over time.
JK: So related to that, there's obviously a shortage of data scientists and Hadoop engineers in the market to work with Hadoop, & it can be a complex system. Are you engaging the community to build up training, working with educational institutions to increase the skill set of engineers?
EB: We do have plans in that direction but have nothing to announce today other than that we are building a training curriculum that we will offer directly & through partners.
JK: What is your take on the Greenplum proprietary model? Why is your model better?
EB: It is open source. And it's more mature technology. Hadoop has been used on very large scale, solving problems for a number of years. The same can't be said for competitive technologies. The companies that are in competition are only a couple of years old. Hadoop has been in production for four years at Yahoo and in a number of other companies for as long or longer. Lots of folks are trying to FUD Hadoop, and because it is an Open Source product you can go and read the code for it and identify things to talk about. But any product that is 2 years younger than Hadoop will have more challenges, not fewer. Which isn't to say that other people might not develop great technology, but they have a lot to prove. A lot of people have announced the intent to build competitive systems, and a lot of those systems are either still in small niches or not relevant to the Hadoop discussion today. Hadoop has gone from strength to strength for six years now. So only time will tell.
JK: I heard you say that you believe that within 5 years half the world's data will be on Hadoop. What's your estimate of the percent of the world's data that is on Hadoop today?
EB: That's a good question for which I don't have a good answer. It has lots of Pbytes of data on it. We should get back to you on that one rather than having me speculate. I think it's a small fraction of the world's data, but the thing that's exciting is the rate at which unstructured data is growing. So we don't think Hadoop will replace traditional systems; we believe there will be more traditional data systems in five years than there are today. But we think that the growth in non-traditional data processing will far exceed the growth of those systems just because enterprises today are dropping a huge percentage of the data they generate on the floor. The rate at which they're generating data is accelerating,& their desire to retain it for longer is accelerating. So Hadoop comes in with a much better price point for retaining data & the ability to do much more with it. So for example several companies are taking their offline archives back online with Hadoop, so they are retaining much more data than they used to. So it isn't that we think Hadoop will replace existing systems. Rather we think there will be an explosion of unstructured data and semi-structured data and just longer attention data that's not in the current production systems.
DV: And won't necessarily end up there – the Oracle scenario that we heard at Oracle OpenWorld.
EB: I think a lot of data will be processed in Hadoop. But we do see a strategy that makes sense that is to have the fine-graded data in Hadoop and then process outtakes in a datamart or a cube. We're seeing a number of vendors doing that, & that's a very viable solution. You get more out of that investment because you can select different views & load your data cube with that. So it gives you a two-tiered solution. So there's undoubtedly a market for what Oracle is doing.
JK: So it sounds like you see traditional data warehousing as complimentary to what Hadoop is doing. Are you running the risk of regulating Hadoop to a storage layer without much value beyond that?
EB: I see tremendous value in Hadoop, but I also see it as a kind of connector between the existing data silos today. We could send you a presentation I did a little while ago to some analysts where we talked a little about this.
I think the way Hadoop comes into the enterprise is as a sort of connector that lets you pull data from a silo that is not useful & get value from that data. Then as your use evolves you start to pull data from multiple places and combine it together in new and unique ways. And you push those results back out into different parts of the organization as well. So Hadoop winds up in the middle. But as you have more data in the center on Hadoop, it creates the potential for more Hadoop-centric applications, and over time I think we'll see more partners com to market with Hadoop-centric applications. So I think it starts in the middle & grows. More mainframes are sold today than at the start of the client/server revolution. So I think it's possible for Hadoop to not displace traditional data systems & still be revolutionary & significant.
JK: So when you say Hadoop-centric applications are a possibility, you see that growing in the future, you mean that applications will live inside the Hadoop environment that are currently not possible or are presently being off-loaded from Hadoop into more traditional data warehousing environments?
EB: Yes, today there are a number of pure-play Hadoop offerings. IBM has Big Insights. A wave of companies are building directly on Hadoop offerings. As Hadoop becomes more common as the connector between data silos, you will see more offerings on Hadoop. I see that all as complimentary to organizations' existing data infrastructure.
JK: I want to touch on the back-and-forth you had with Cloudera about who's contributed the most & how you define that. Why is it important to hold that title?
EB: We are offering a service to partners, which is to help them understand & integrate their products with Open Source Apache Hadoop. So it's important to us that people understand that we have led every Hadoop release & implemented every major feature added to Hadoop since its inception. That's our differentiation. So we'd like to tell that story. We believe that anybody who actually inspects the source code repository at Apache will understand our conclusion. So it's important for us to get our own message into the market. As I said, we're focused on growing the pie. We've worked with Cloudera since their inception help them grow their Hadoop business & are happy to continue doing that. We hope to define a partnership, a broad partnership eventually, and in the meantime there's lots of places where we are working together to enhance the Hadoop ecosystem. I think the dispute has perhaps been overblown.
JK: Could you talk about the milestones we should be looking for going forward as observers & the timeframes for those in whatever detail you are comfortable with?
EB: This year our focus is putting out v 0.20.205, which we think is the best release of Apache Hadoop ever & the first one with native support for Hbase. Previously you had to build a special version of Apache Hadoop or use a non-Apache build to use Hbase with Hadoop. So we have just released .205, which we think is a very important release of Apache Hadoop.
Beyond that we're focused on building our training & support offerings this year. We anticipate announcing new partnerships in the coming months & having alpha & beta training courses this year. We are also going to incrementally improve some of the core Apache projects that we have focused on including ??, which is an installation of a management suite that in the short term focuses on making it easier to install Hadoop. And we are focusing on Hcatalog, which is another Apache project tat makes the Hive table model available to ??. It's proving a very interesting project for companies looking to integrate with Hadoop. It presents a more traditional model that makes it easier for them to integrate their tools with Hadoop. So you should look to see improvements to Hcatalog and ?? and the release of our training and support offerings.
Then we're very focused on 0.23.0, which is the next release of Hadoop. We anticipate an Apache Alpha release this year which is not something anyone would install in an enterprise but will let us test & refine that release. We anticipate that to reach beta quality at end of 1Q2012 so people can start using it at that point for real production applications.