The following is a transcription of an interview between David Vellante (DV) and Professor Mike Stonebraker (MS) of MIT at the Mass Technology Leadership Council event February 17, 2011. on the Siliconangle Cube. Their discussion focused on big data and was conducted earlier in February 2011 after a panel discussion.
DV: Hi. This is Dave Vellante of the Wikibon Project. We're here with Professor Mike Stonebraker of MIT, at the Mass TLC, big data, tsunami. Mike is one of the foremost database people in the world and just ran a panel discussion of database types discussing architecture,so I want to talk a little bit about that. But first of all, do you not like farming?
MS: That was the comment in the panel that Zynga is apparently the fastest growing IT company on the planet right now, and its leading game is Farmville. You do virtual farming over the Internet with your friends, and I have to admit I just don't quite get it.
DV: You do real, organic farming. So you said that data storage is actually not that hard. You know, Google has half a million servers. It's quite easy to store information; it's managing information that is difficult. Why is that?
MS: If all you want to do is store encyclopedia facts and retrieve those facts, or store 10 gazillion pictures and get them back by ID, that isn't difficult. But suppose you are the U.S. military, and you want to put a video cam on every light pole in Iraq. Then the problem is you want to spot specific cars – those that seem suspicious and may have bad people inside – and track those cars as they drive by successive intersections so that you can know where it stops to plant a roadside bomb. That requires managing and interconnecting data, and that's what's hard. So just storing facts is not difficult, but doing data management and allowing sophisticated queries in huge data sets is hard.
DV: So you talked about the data tsunami coming from two places: Web 2.0 and scientific/research types of applications. Can you talk a little more about that and what you're seeing. Are we seeing a big sea change here, or are these just sort of incremental to what we know today in IT?
MS: I think it's a real sea change. I think the genomics applications are going be overwhelming because sequencing a human genome is getting to be relatively cheap, and it's going to be socially advantageous to sequence essentially everyone on the planet. Then the kind of queries that you want to run, for instance, is to take all those billions of individual genomes, correlate them with the diseases each individual suffers from, and then try to identify genomic sequences that might have made specific populations more susceptible to specific diseases. The social benefits of that kind of research are potentially enormous, so the big pharma companies are looking at this data tsunami coming straight at them.
Then the Web 2.0 companies are all trying to discover information in about their consumers to improve their sales and streamline their operations. So Netflix wants to do a better job of predicting what movie their customers will like; eBay wants to a better job of predicting what you will want to bid on when you log in so they can suggest the right auctions to you; etc. So they're all trying to do consumer modeling. This is something new and different and hasn't been done before. So I think this a sea change in the kind of apps that people are focused on. And its not really coming from traditional enterprise data warehouses. This is a green field of new stuff that presents a great opportunity.
DV: I inferred from your remarks that you may be a little bit of an Open Source, and especially cloud skeptic. Is that unfair and my follow-up question on that is: Is open source software potentially commoditizing software? And is data becoming a new leverage point of competitive advantage?
MS: Okay, first off, yes I am a slight cloud skeptic. I went to a conference about six months ago where the organizers asked a room full of about 250 people, "How many of you are willing to put production data on the public cloud?" And only a smattering of people were. The reasons the attendees gave for not putting their data on the cloud were: A. regulatory requierements; B. privacy requirements; C. security concerns for in some industries that can't have their data outside the country. Some companies have a policy that simply doesn't allow them to put data on the Web. So I think it's going to be awhile before huge amounts of mission-critical data go onto the public cloud.
On the other hand, everyone I know of is investing in private clouds that are inside the firewall, so I'm a huge believer in grid computing. The real question is, "Who's gonna run it?" Is it inside the firewall, or outside?
And then I'm a huge fan of Open Source, and I think its wildly misunderstood by almost everyone. To me Open Source. Standard enterprise software is sold by a four-legged sales team which is a sales guy who's only smart enough to take the customer to lunch paired with an SC who answers technical questions. Then the sales cycle go on for a year, so you have a very, very, very expensive sales model. Open source has a wildly cheaper sales model, which is "drive people to your website." If they are interested, let them download the code, let them do the selling themselves. Open Source vendors don't send sales teams out to companies, so it's just wildly cheaper. And anybody who puts system software into production will buy support for it, so you end up selling them an enterprise version anyway. So this is about customer acquisition and sales models, and getting downstream revenue from supporting, and providing enhanced features. So I think it's a fabulous sales model because the traditional sales model is so inefficient.
DV: So does that suggest that the point of competitive leverage in the technology business will more about giving the software away to sell the support? We've seen Red Hat and Cloudera do that. Or do you see an even bigger wave around data becoming the competitive differentiator?
MS: I think the model Cloudera and Red Hat are using will be very successful. I see a huge market for figuring out how to enhance your data, because everybody has an unbelievable amount of it. And the question is, "How do you get information out of that data?" And that's very market specific and often very complicated, so I think there's a huge opportunity for vertical market specialists to figure out how to build vertical market data enhancement systems.
DV: We were out at Strata two weeks ago at a big data conference held by O'Reilly Media, and Tim O'Reilly came on the Cube to say that the world needs more startups. What do you think about that?
MS: My view is that large enterprises by and large don't innovate. The innovation comes from start-ups. Large companies acquire the startups, and EMC acquiring Green Plum is an example of this. What I see is that technology transfers to startups who then do the heavy lifitng, and then the succesful ones getting acquired. So yeah, let there be more startups.
DV: Excellent. Thank you very much. Appreciate you coming to the Cube.