“I think we are witnessing the second industrial revolution,” says Abhishek Mehta, managing director for big data an analytics at Bank of America. “And it is fueled by data. And it will bigger than the first industrial revolution, because finally technology has democratized not just the access of data to a plethora of new companies but also the ability to store, mine, clean, analyze, and produce data products that can solve problems that you could not have solved before.”
That new industrial revolution will be based on what he calls data factories. These are modeled on pioneering companies such as Facebook, Google, Twitter, Yahoo!, and Zinga. But the strategy for building a data factory in an established company like Bank of America with a huge existing infrastructure is “one of the big white spaces in the industry,” he said to Wikibon Co-Founder David Vellante and SiliconAngle Founder John Furrier at HadoopWorld (). “We have set the challenge for ourselves of building the first financial services data factory.”
Another big white space is automating the data pipeline. That is mandatory for handling very large amounts of data, and “it doesn't exist today. You have to build it.”
Part of the mechanics are clear, he says. All the factories run on “what I call the 'data factory stack', with commodity hardware, massively parallel architectures, moving the smallest amounts of bits across the network.”
That last aspect, he says, requires a reversal of the normal practice of moving data to the application. In the big data world, data sets come in Tbytes, much larger than the analysis algorithms. Trying to move all that data to the application will cause massive problems on the network in the data center. Therefore, he says, one of the core principles of the data factory is “pushing the code to the data rather than the other way around.”
Today, he says, the only technology that can be used in this way with massive amounts of data is from a new BI company called [www.tableausoftware.com Tableau Software] in Seattle started by Stanford Professor Pat Hanrahan. “He completely turned the concept of BI on its head and said, 'BI needs to be used by the business user, not the technologist.' So Tableau does just that with their own custom language called VizQL, which pushes the code out.” Vertica and Astra have similar concepts on the database side, he added.
Another core principle that goes against established practice, he says, is that “algorithms are no longer proprietary.” The winning strategy is not to focus on writing the next graph algorithm, because long before it is done someone else somewhere else will have already written it. “How you apply it and which problem you apply the algorithm to is now the important thing. This is a massive game changer, because people for the longest time have spent their time thinking about writing an algorithm that nobody else has and then protecting it.”
In fact, he says, a lot of camaraderie has emerged among the developers in the leading data factory companies in Silicon Valley as they have realized this. Because they no longer have proprietary secrets to protect, they work together, with one company applying an algorithm to one question and another using the same algorithm to solve some other problem.
Action Item: The third vital missing piece is a clear set of laws governing digital rights. Who owns the data, and what can you do with it? What security and privacy levels are required? Some very smart people are working on this, he said. “I think the industry needs to take a leadership role, come together, and establish a set of laws, just like those for physical properties, to govern ownership and use of intangible property like data. That doesn't exist today, and the industry needs to make its own proclamation of digital rights rather than waiting for someone to do it to us.”