There’s been plenty of talk about big data lately and it’s finally spilling into the world of infrastructure. I’m not surprised. But there’s a lot of confusion and many frustrating misconceptions. I understand. It’s a confusing topic – especially to us infrastructure people. We like to simplify things. Big data = big iron, big RDBMS, beyond terabytes. Big. Data. I Get it!
My new friend Andreas Weigend said to me the other day “Dave, infrastructure is irrelevant.” Ouch. The truth is he’s largely correct. Infrastructure is plumbing. Plumbing is only relevant when it’s not working– then it matters a lot – otherwise there are way more important things to talk about that deliver true value.
But infrastructure enables applications you say– Infrastructure can be profitable. Well I’m all for enabling value and profits that’s cool. And we need infrastructure to deliver data value and people can profit from that – but it’s not going to be your daddy’s infrastructure that powers big data.
Ten Big Data Realities
Here are the first ten points that I want you think about when you’re grokking big data:
- Oracle is not big data
- Big data is not traditional RDBMS
- Big data is not Exadata
- Big data is not Symmetrix
- Big data is not highly structured
- Big data is not centralized
- IT people are not driving big data initiatives
- Big data is not a pipe dream – big data initiatives are adding consumer and business value today. Right now. Every second of every minute of every hour of every day.
- Big data has meaning to the enterprise
- Data is the next source of competitive advantage in the technology business.
The Next Big Thing
Visionary Tim O’Reilly underscored this last point when he spoke with me and my colleague John Furrier at Strata’s Making Data Work event. O’Reilly is the man who led the industry in spotting huge trends including open source software and Web 2.0. He told us that he came to the conclusion that data was the next point of leverage by observing the PC analogy. He cited IBM’s mistaken assumption during the PC era that hardware was the primary source of lock-in, only to blindly support the commoditization of hardware and hand over its monopoly to Microsoft. Tim started to ask “what happens when software becomes a commodity too?” This is when he realized that the next source of competitive advantage would be “large databases generated through collective action over the Internet.” Check out Tim’s comments in this short video.
This brings me back to so-called big data. Back in the day, if you had lots of data to analyze you’d buy the biggest Unix box you could lay your hands on and if you had any money left over, you’d pay Oracle through the nose for some database licenses. That big Unix box became a “data temple” and the DBA held the keys to the kingdom. You’d bring all of your data into that box where function resided in the form of code; all revolving around a relational database.
When Google started its search operation it realized that it couldn’t suck this huge volume of dispersed information into a data temple – it just wouldn’t work – so it developed MapReduce and the early days of big data were born which led to Doug Cutting and his friends inventing Hadoop (with some help from Yahoo) and then this whole ecosystem around big data and Apache, Cassandra, Cloudera and a zillion other important pieces has exploded.
No Oracle. No Symmetrix. No RDBMS. Very unstructured. Highly dispersed. Lots of data hacking from multiple sources on the Internet. Inside and outside of firewalls. The basic premise was don’t bring petabytes of data to a temple, rather bring megabytes of code to the data and avoid the network bottleneck– whoa – V8 moment!
So lots of Internet companies have hopped on the big data bandwagon – Google, Yahoo, Facebook, Twitter, etc. But it’s not just the Web whales. Abhi Mehta, who at the time was with BofA told me financial services is all over big data and it’s changing the business model. For example, sampling he said is dead. Rather than build fraud detection models on samples and realize an outlier breaks the model and then have to tear it down and rebuild the model…financial services firms can now analyze five years of fraud, every single instance and operate on the entire data set to spot patterns and trends – in orders of magnitude less time. That’s game-changing.
And it’s not just financial services and Internet businesses. Manufacturing, energy, government, retail, health care…everyone has a big data problem – or should I say opportunity? White space is everywhere. Securing data, automating the data pipeline, new ways to visualize data, new products built on data (see LinkedIn Skills), new services – the big data list goes on and on. And it’s here today – just Google Pizza and you’ll tap an enormous database from your mobile access point and you can get recommendations, menus, directions, deliveries – whatever you need to make a decision.
What Does Big Data Mean to Infrastructure Professionals?
Here are the next ten things you should know about big data:
- Big data means the amount of data you’re working with today will look trivial within five years.
- Huge amounts of data will be kept longer and have way more value than today’s archived data.
- Business people will covet a new breed of alpha geeks. You will need new skills around data science, new types of programming, more math and statistics skills and data hackers…lots of data hackers.
- You are going to have to develop new techniques to access, secure, move, analyze, process, visualize and enhance data; in near real time.
- You will be minimizing data movement wherever possible by moving function to the data instead of data to function. You will be leveraging or inventing specialized capabilities to do certain types of processing- e.g. early recognition of images or content types – so you can do some processing close to the head.
- The cloud will become the compute and storage platform for big data which will be populated by mobile devices and social networks.
- Metadata management will become increasingly important.
- You will have opportunities to separate data from applications and create new data products.
- You will need orders of magnitude cheaper infrastructure that emphasizes bandwidth, not iops and data movement and efficient metadata management.
- You will realize sooner or later that data and your ability to exploit it is going to change your business, social and personal life; permanently.
Are you ready?