ed note: This is a transcription of [ http://www.siliconangle.tv/video/kevin-weil-twitter-analytics-lead-hadoop-world an interview with Kevin Weil,] analytics lead for Twitter, at HadoopWorld 2010. conducted by Wikibon.com Co-Founder David Vellante and SiliconAngle Founder John Furrier webcast on SiliconAngle.tv. Because this is a transcript and not intended as an article, speakers are identified by their initials.
JF: So you gave a talk, and I heard it was pretty well attended. How did it go?
KW: It went well. It’s a short format here; they have so many people, and they are doing it all in one day. But it is amazing to see the turnout.
DV: So we were broadcasting live here during your talk and didn’t have the opportunity to hear it. Can you give us the bumper sticker version.
KW: Sure. The basic premise was everyone’s using Hadoop, and the fact that 1,000 people are here today shows how powerful Hadoop can be. But the ecosystem around Hadoop is where we still need innovation. So how do people get data into Hadoop, how do they store data in Hadoop, and how do they integrate Hadoop into their larger ecosystem & workflow, because generally you have an online site you are managing as well, and Hadoop is just part of what you do. So what I was talking about was some of the tools that we use that are Open Source or have developed that are Open Source and how those fit into our Hadoop cluster, and why those are useful, & how other people can use those.
DV: So those tools you develop yourself you put back into the Open Source community?
KW: At Twitter we try to do that as much as we can, but the company…the company believes in that, & all the engineers believe in that, which makes it a great place to work.
JF: You guys do keep some in house?
KW: Some stuff we can, yes.
JF: Two big things we are hearing here about Hadoop that are really helpful are: 1. The notion of analytics getting answers to questions that previously were impossible, and then, 2. Real time data.
So that seems to be the two common themes we’re hearing. How do you view Hadoop from that perspective?
KW: Most of our uses for Hadoop are big data batch oriented, they’re not real time. Hadoop itself wasn’t built to do real time, so it doesn’t specialize in real time data. If you’re okay with taking three or four minutes to run a job, Hadoop is amazing. But as far as a real-time system, you have Hbase, which was built on top of Hadoop to be the real time big table clone. We’re starting to use that a little bit, but by and large our real-time stuff is done outside of Hadoop. So when you build a timeline on Twitter, and I show you the most recent tweets from you and you, that’s not done in Hadoop but in MySQL or other systems that are built to be low latency.
JF: When you are doing a lot of analysis three minutes is not a long time to wait. It’s a great response time for heavy lifting data.
KW: Exactly.
JF: How much data do you guys use in your Hadoopo bucket?
KW: The number I can share is we have 12 Tbytes of data coming in every day. So Hadoop is a big part of our ecosystem.
JF: What tools did you mention in your talk that you are using?
KW: We actually use a pretty broad range of tools. We use Tribes, which was open sourced by Facebook, to get data into Hadoop. We open sourced a tool called Elephant Bird, which we use to store data and read data in and out, and then we are heavy users of Pig and are getting into Hbase more and even a little bit of Hive. So we’re across the spectrum, & we’re also fortunate to employ committers on all of those projects, so we have a Hadoop committer, we have a Hive committer, we have a Pig committer, which means we can contribute back effectively to all of the projects where we use them.
DV: So if I wanted to measure whether there is a positive sentiment around my brand, would I do that with some kind of Hadoop platform, or would I use some of the real time stuff you mentioned? Is that even a legitimate question to be asking today, or is that in the future.
KW: Sentiment work is challenging in general. Everyone would admit that. It certainly is a case where Hadoop would be applicable. It’s also something that we’ve seen – Twitter has an API and a platform, and we have a lot of third party developers out using our API. One of the big things we’ve seen crop up around the API, we’ve seen third-party developers building semantic analysis platforms and tools to work with brands.
DV: So you might go in and see if a brand is popular. So Hadoop may or may not play in that area?
KW: I would imagine a bunch of those companies use Hadoop. Maybe not all of them, it depends on what they’re trying to do, how big the data is.
JF: Kevin, what’s your background? And talk about the kind of people Hadoop’s attracting? It would seem to be a class A entrepreneur, engineer type, but not all super geeks. A lot of statistical guys, a lot of math, a lot of computer science, some AI. It seems to be a cross-section of disciplines. It’s not the normal network guys. It seems to be a new breed of engineer/scientists.
KW: My background’s actually in science. I was in physics & math before I dropped out of a PhD in physics to join the startup world.
JF: Like Mike Oleson.
KW: There’s a whole subculture of PhD dropouts. I think a lot of people started like me. I was working for Tropos Networks analyzing mesh network data. So we had Gbytes of data, & I was writing Perl script to analyze it. And when I moved to my next company, we suddenly had Tbytes of data and those Perl-based tools were not working any more. So I think the reason you are seeing a lot of statisticians, mathematicians get involved is they have worked on smaller data sets, they know how to do that. Suddenly they are faced with larger & larger datasets, & Hadoop is one of the best tools out there to do it.
JF: So anecdotally are you seeing any kind of pattern in the kind of people?
KW: I think it starts with analytics, people who are in the data space. But then as Hadoop gets more broadly used, as Cloudera becomes more and more successful at making Hadoop easier to use, because they are doing a great job at taking away a lot of the hard edges, making it easier for developers to pick up and play with, you will see more & more software engineers get involved. We actually have product marketing people at Twitter who use Hadoop every day, people who can hardly write code, whom we taught to use the clusters.
JF: Did you do a front end for it?
KW: they actually are at the terminal. They aren’t coding, but they are running queries. Which is awesome to see.
JF: You guys are growing so fast you have product marketing people in terminal mode. How good is that?
KW: they’re up to their elbows.
JF: So we were just talking to James Phillips, co-founder of Membase, which was North Scale, and they are powering ??? and competing with Twitter for taking the most real estate in San Francisco. Zinga’s growing like crazy. He was talking about the scale of operations. You guys deal with that every day also. You were talking about Perl scripts to massive data pools. What’s it like and what mindset do people need to have, & what advice can you share with people, because you guys are the leader in terms of drinking from the fire hose. You’re operating at scale & growing & having to rearchitect and doing all this stuff at massive scale. You can’t make a mistake.
KW: One of the things I try to talk about when I am up there is there are numerous challenges to growing this fast, & one of them is serving the online site at low latency. That’s one challenge. I deal more with the analytics side of the challenges, which are data is growing at immense rates, & you’ve got to get things out of it that you can then use to help the business, help the product. One of the things that Hadoop brings to that is its ability to be scale free. So if your data goes up by a factor of 10, as long as you can scale out your hardware your code doesn’t need to change. So you don’t have to go back and rewrite & redo all the work that you’ve done. This is one of the first times that has been true. Suddenly you can literally throw hardware at the problem. It means that we can continue innovating on the analytics side rather than having to go back to redo what we did because we have 10-times more data than we used to.
JF: We sit up in the Cloudera office. That’s where our office & studio is in Palo Alto. So we see the guys there all the time. One of the hallway conversations that are going on all the time among the supergeeks there is this notion of scaling out. Suddenly you have new issues. Scaling out is not a trivial thing. Do you have any opinions on the challenge of scaling out and points of light for people out there?
KW: The situation is much better than it used to be. If you had the same problem 10 years ago you would have been building all of your own stuff from scratch. Google built all of their own stuff from scratch. Now with the open source community growing there are more solutions out there. When you’ve got people like Google who did it, and even if they didn’t open source a lot they at least talked about how they did it. That led to people to start to open source things. As long as you’ve got brilliant people around who will build things. Then Facebook has done a pretty good job of open sourcing some of the tools they built that helped them scale. We’ve tried to as we’ve grown tried to do the same & open source as much as we possibly can so companies that come behind us don’t have to deal with some of this stuff any more.
I think in the ideal world all of the back-end commonalities are open source and companies get to innovate in their particular domain, but they don’t have to reinvent the entire stack every time. That will make everybody work faster.