Ed. Note: Always interesting and frequently controversial, Abhi Mehta provides his latest vision of the future based on Big Data, along with two major announcements from Tresata, the company he founded and heads, in this transcription of his latest interview in the Cube. The interview from which this transcription was taken, with Wikibon's David Vellante and SiliconAngle Founder and CEO John Furrier from the Strata + Hadoop World 2012 conference in late October, is available as a recording on SiliconAngle's Youtube channel and is intended as a companion to that recording. Users can make best use of this by reading along while watching the recording or using it to review the material after watching the recording.
DV: We're here with Abhi Mehta, the founder & CEO of Tresata. Abhi, a good friend, welcome back. I remember at the second HadoopWorld we had you on & you made all these tremendous predictions about the data pipeline and how Hadoop was going to change the world, & how really important & profound it was to bring, for instance, 5 Mbytes of code & 10 PB of data, & that's exactly what's happened here. All these predictions you made came true. So we're glad to have you back & hope you will make some more.
JF: You did your magic in the Cube & I saw you did your magic with the groundbreaking video you did with Data Factories & again is well regarded in the industry & is a seminal moment in Cube history. So tell us, what's new now.
AM: Absolutely. First it's always a pleasure to be here & talk with the two of you. My brief from John is always let's be a little controversial. So we should make our next set of predictions. We're seeing the emergence of what I'm calling the Three A's of of analytics:
- The first is “As-a-Service.” It was very refreshing to see – & you guys covered both the events – Oracle & IBM come out & say that “As-a-Service” is a business model that's not going to go away. We're seeing that on the client side as well. I think as-a-service is no longer a choice or an option, it's a necessity. If you are in an industry vertical, & you have a competitor – an emerging competitor or existing competitor – that has zero upfront capital investment, & is redoing its own infrastructure to do new products & services, you are not only at a disadvantage, but its going to differentiate the winners from the losers. So as-a-service, regardless of which industry vertical you're in, is not a choice, it has to be done.
- Second: Delivering actionable insight. It's all about business value, answering the questions that couldn't be answered before. And I do think it's something that was said on your show that clients are less interested in the word “Hadoop”. So for 2 yrs we've talked about Hadoop; I think you will see in the Cube that people will talk less & less about Hadoop and more about the answers that are delivered using technologies like Hadoop. It doesn't matter if you are delivering actionable insight using DB2 or Greenplum or HBase as long as it gives you actionable insight that solves a business problem. On the back end if the engine is powered by Hadoop, doesn't matter. If that means that the technology vendor or the startup company has economics ?? doesn't matter. Because you're solving a big problem. (DV: Using Oracle – uh oh!) I think IBM's announcement at the Big Data conference was huge. And things they were talking about – Big-Data-as-a-Service makes an interesting acronym – B(a)DaaS – but big-data-as-a-service is becoming de regour.
- AI: The launch of the next intelligence engine, the new AI. It's assigning intent. It's a really big concept that people r trying to get their minds around, which is: How do you assign intent? Irrespective of where the data comes from. I think I'm tired of the words structure, unstructured data, multistrutured data – Doesn't matter. If you have the ability throu u'r analytics engine & platform to assign intent at the individual level, it will help you to influence consumer behavior. That is an excellent ???
So we're launching a private beta of a very interesting project to look at SKU-level information, the actual SKU-level data, and attaching to that social data amassing from Facebook. I do believe Facebook data changes fundamentally the way we drive, design, & offer products & services to consumers. So you will see us hopefully at the next Cube at Hadoop Summit announce the first commercial use of Facebook data against SKU data to show how you can drive sales through Facebook. I think that whole concept of assigning intent irrespective of source system format is the future, because I do believe social data is a really big revolution that will absolutely change how companies understand what consumers want & deliver products & services as consumer advocates, not trying to make them pay for things they shouldn't be paying for.
JF: What we are seeing is that analytics is the top killer app. Last year at HadoopWorld people were predicting $100 million in investment in applications. It kind of didn't happen this year. What did happen is analytics – that's the killer app. What do you make of that? Do you agree that applications haven't evolved fast enough? Is it a maturing issue? Why analytics is the killer application?
AM: Best I've heard in the space is research by the Research Board. They call it, “the liminal moment in analytics and data”. It's interesting because liminal means you're at this inflection point. No one really knows. We all know what the data & analytics looks like today. We all know what the data & analytics structure looks like today. It's easy to find. You have hardware, software, these tiers with BI. No one really knows what the emerging big data analytics stack will look like. Will there be middleware? Will there not be middleware? Will there by BI? Will there not be BI? So I think there is a little bit of a jostling for space figuring out is apps the next big thing for Big Data, or is it analytics. It doesn't really matter if it is as-a-service, but it's a question of applications or not. So I think there is a little jostling to figure out, & hopefully Wikibon will write the next big report on what is the emergent stack for Big Data.
JF: We're working on it. My opinion is it's early. I think there's a lot of innovation that has to be done on performance, reliability under the hood that the geeks are working on. But I think analytics is the easy, low-hanging fruit. People are getting insights that they never had before. This was highlighted by Tim Askey's keynote yesterday from Digital Reasoning where he showed a graph of the data growth that's massively going to the moon, & then he showed a graph of human attention, & that is flat. So humans can only grock so much information. He calls that the understanding gap. That's where analytics really plays a big part in this. So to me I think it's too early to predict what apps will be out there because it's the cart before the horse, the chicken & the egg. But analytics gives you insight right now. So I think that's a good sign. I don't see any negative signs, but I think that's a factor.
Now the stack is an interesting issue. There's a platform war going on in the Hadoop world. Cloudera wants to be the Big Data platform. They're now competing with MatR & others.
AM: That's right. And IBM
JF: And IBM. Hortonworks is taking a different approach. They're more the OEM, saying, “Hey, we'll do business with 'providers'.” So it's interesting. I see those approaches as both viable. One, I've got to compete and win – winner takes all. And the other one is you get nested into these other platforms – Teradata, Microsoft....
DV: {Cloudera Chief Scientist Jeff} Hammerbacher talked about this a little yesterday. Where are we going next? How can we layer on other models, things like support vector machines or whatever it is that pops up. So there's a discussion going on around does more data beat better models. And often times misconstruing [Google Director of Research] Peter Norvig's statement about the data?? And basically when we first met you in the Cube, you said sampling's dead. That's powerful. I think people glom onto that and say “Sampling's dead. Data beats models. So we don't need models any more.” that's not true, though. People can do maybe more with better algorithms. So I think you're right. There may be other layers in the stack that maybe relate to better modeling. Maybe it's simplified modeling.
AM: I agree. I think the fundamental principle that bottom-up analysis always wins, always holds true. If you can write models that the outliers inform rather than stress, something is dead. ATM is dead because you can't afford to move data back & forth. I don't know what that means to existing players. The big question we ask ourselves is this: If I were to build the next generation data analytics company, what am I building it on? If it's HTFS {High Throughput File System}, which I think is the right answer. That's what I'd build on, right? I'd bet on that. Then do you need layers on top of HTFS to make a solution like mine more useful? The answer, Dave, is, I don't. So what does it mean to try to build those layers on top of HTFS? I don't know. I think that liminal moment changes rapidly because … I've always said this: Big data is not a technology opportunity. We said it together. Big Data was not a technology opportunity. You said that for two years now. And then Gartner comes out finally after two years & says well it's a $200B opportunity. Well, it's bigger than that.
We're talking to head of credit cards at one of the biggest global banks. He looks at our solution for underwriting and he goes, “If we do this well the payoff is not in the millions. It's not in the billions. It's in the trillions. Because we can redo the banking industry off of it. No one is writing that story.
[To JF] So I think I agree with you. It's a little early, it's a little unclear as to what the stack will look like. Someone needs to make a ?? on what the stack should be. I think analytics is easy to gravitate to because more data does trump better models. Here's the challenge. If I'm to assign intent, that model has not been written.... So I think you'll see the word “engines” appear more than “platforms”, than “database”, than “data warehouses. Because at the end of the day it is a combination of smart people, smart algorithms, & this last component of truly democratized data on a platform that really is free.
And people are used to that idea. I was talking to an investment banker. And he goes, “You know the big realization companies are having is software is free.” So if software is free, where is the money? This is what IBM is doing. IBM's Smarter Planet positioning is by far the best marketing positioning ever done in the history of technology, because people aren't asking the question how the electricity in a city is optimized. They say, “I'll do it. It may be magic. Don't worry about how it is done. I'll do it.”
We call it the last mile of analytics, because it's not about data, it's about analytics. The last mile of analytics, 9/10ths of the mile can be automated, John. Absolutely can be automated. But the last 1/10th cannot. So the 9/10ths is machine learning. What does machine learning do? Machine learning mimics smart people and makes the machine do the work of people in a smarter way. In a repeatable way. In a scalable way. If I have to build a new analytics engine that works for the US population, whether its Facebook for social, a bank for a retail project, or for retail, I have to make machines do it because there's no way I can build an analytics engine, a business, manually around people manipulating data for 250 M units. Not possible.
So machine learning can automate 9/10s of that last mile. And the last 1/10th comes from people. That's the other issue. Data scientists, while that is a very interesting term, data science doesn't exist yet. So our chief scientist, he's a new person and at some point I would like to introduce him to this Cube and to you guys, is going to be launching a center for data science, we hope to invite you, in summer. It will be the first master & Ph.D. Program on data science. Because we are seeing customers ask us, “I love it, I love the solution. But I don't have the people. I don't have the people to write the new models for all this data. So sampling is dead, and we need to analyze all of the population. Who'll write it for me?” That needs to come. It's not there yet.
JF: So talk about your view of the current situation for data, because data now is being talked about as the raw material for analytics. In your keynote yesterday you talked about that. You said, “If software's eating the world, then data is the meal.” So software is part of everything we talk about on the Cube, but data is critical, you've got bad data either you clean it or get good data. So data's critical. So the role of the data scientist. Talk about the range of the role of the person or team, & what's going to happen in this data market.
AM: We see the role of data science in two parts or two roles. One is incredibly sexy and one that is incredibly boring but is 80% of the work – we call them data quants and marketing quants. There are enough marketing quants in the world who can use existing tools and write interesting algorithms to solve problems that we all are aware of. You can find tons of them. They all know certain tools that we don't need to talk about, and they've all been to universities and colleges.
The data quant role is incredibly ill-defined. But it has the most value, which is people who have the ability and resources. It's sort of like a new data cyborg. They are part physicist, part mechanical engineers, part statisticians and a total ?? of common sense. They can look at data & go, “If I was to mash social data with SKU-level data and an intent engine & bring it all together, I can change the way marketing is done in every vertical. Or I can change the way you design a product, and that could be a banking product or a healthcare or medicine for medical research. That skill set is incredibly ill-defined & not available in the market. I think that's what's lacking where 80% of the work comes into play because a data quant with the right tool set to function with data that's growing at I don't know what, a new word, celabytes” – there's no human way possible. So to train the data quants is a very interesting problem that no one today is solving. And it is a cyborg from a skills perspective.
JF: We were talking last night, kind of ripping on this. And I was saying I equate it to training a dog. There are different kinds of dog – there are retrievers and …. But you've got to train the machine. But the machines are trainable. What you're talking about is training the machine to do the learning & reason to create that gap because humans can't possibly get through the data. So the insight has to be done by machines. So all the startups that I've seen that are good are doing that. But that's hard to do. How do you train a machine?
AM: I think it's also a bigger technology driver that I see you guys have done a phenomenal job of tracking. I call it the iPhone-ization of IT. So Enterprise IT will go the way of iOS and iPhone. You're going to buy incredibly cheap hardware for almost free, like you buy an iPhone. And every two years you get an upgrade for almost no cost. Hardware is free, storage is free. And the operating system that you run on that hardware is going to be free. Because it should be. Should you pay for HDFS? No, you shouldn't. It's the iOS of data. It should be free. Then you're going to buy applications. Because then your data is sitting in HDFS. Why should you pay for HDFS? The distribute model I don't buy into. I love MapR, I love Hortonworks. It's free. Come on guys, let's get over it. So the distribution is free, and what do you pay for? You pay for the apps. And some apps will be free – could be bookkeeping will be one of them – and some apps you pay money for. I think enterprise IT goes that way.
JF: Abhi, I have to interrupt you because we've got planes stacking up as Mark Hopkins likes to say. So I want to ask you one question to end on: What is the vision for you & Trisata going forward? Give us a quick update and what's your next set of plans.
AM: We have a new mission statement. Trisata has only one mission, which is we help all our customers use all of their data to get, grow & keep their customers. Big companies are not too big to fail, they're too big to manage. Trisata will help you manage those companies and manage their customers and build products & services that advocate to consumer needs.
JF: So Abhi Mehta, always interesting, always controversial, it's good to have you on. So the slug fest continues at Hadoop World.