The proliferation of Open Source tools contributed by leading-edge data factories like Facebook, Google, and Twitter are easing the path for companies dealing with the Big Data explosion of the Internet, Kevin Weil, analytics lead for Twitter, said at HadoopWorld 2010.
“The basic premise was everyone’s using Hadoop, and the fact that 1,000 people are here today shows how powerful Hadoop can be,” he said in an interview with Wikibon.org Co-Founder David Vellante and SiliconAngle Founder John Furrier, webcast on SiliconAngle.tv. “But we need innovation in the ecosystem around Hadoop. How do people get data into Hadoop, how do they store data in Hadoop, and how do they integrate Hadoop into their larger ecosystem and workflow?”
A decade ago, when Google started, it had to invent all its own tools, including the database technology itself. Today things are getting much better, largely because of the open source community. Google has not contributed many of its tools to Open Source, but its designers do talk about how they solved problems, which gave other developers ideas about how to approach their big data problems.
“Facebook has done a pretty good job of open sourcing some of the tools they built that helped them scale,” he said. Twitter has benefited from those and tools from other pioneers and has released many of the things it developed back into the open source community. With 12 Tbytes of data pouring in through the firehose daily, it needs all the help it can get.
“We use Tribes, which was open sourced by Facebook, to get data into Hadoop. We open sourced a tool called Elephant Bird, which we use to store data and read data in and out, and then we are heavy users of Pig and are getting into Hbase more and even a little bit of Hive. So we’re across the spectrum.”
Twitter also has published an open API and platform and has an active third-party development program. “One of the big things we've seen crop up around the API has been semantic analysis platforms and tools,” he said. These can be used to recognize mentions of specific brand names in Tweets, which can be a first step toward analyzing Web content to gauge the public perception of those brands.
Open source cannot supply everything, however. Mr. Weil credits Cloudera with developing an excellent set of extensions and tools to make Hadoop easier to use, which in turn encourages increasing numbers of developers, software engineers, and even non-technical people to join the growing community of Hadoop users.
“We actually have product marketing people at Twitter who use Hadoop every day, people who can hardly write code, whom we taught to use the clusters. They aren't coding, but they do run queries directly on the terminals.
“I think in the ideal world all of the back-end commonalities are open source and companies get to innovate in their particular domain, but they don’t have to reinvent the entire stack every time. That will make everybody work faster.”