I’m often asked by customers when to use Hadoop and when not to. First ask yourself whether you are getting the most out of all the data available to you – both internal and external to your enterprise. You may have pondered this very question on your way into work as you sit in traffic.
Or if you find yourself sitting in traffic for hours you may instead wonder if there is a better way to commute and avoid all the traffic jams. Maybe taking public transportation or carpooling would be faster and cost less. But these alternatives have some drawbacks too. You would have to coordinate your schedule with the train, subway, bus, or a colleague willing to commute together. In other words a little more preparation and wait time is involved as opposed to simply jumping in your car and hitting the freeway. The tradeoffs you make relate to performance (measured by how fast you can get to work) and cost (measured in gas, fares, effort, and inconvenience).
Similarly, whether or not Hadoopcan help you get more out of your data -- to gain competitive advantage, discover customer insight, and innovate faster -- relates to tradeoffs in terms of performance (time to process the data) and cost (investments in hardware, software, and skilled resources). Your organization has probably spent years building out its information architecture to support growing transaction data volumes consisting of years of historical data combined with up-to-the-second operational data, known as big transaction data.
What’s different today is that big interaction data is exploding as social and mobile applications become more prevalent, machine device data (e.g. utility smart meters) is collected on the order of minutes, and web logs, clickstreams, and call detail records (e.g. customer support calls) generate terabytes of data on a daily basis. Imagine what you could learn about customer sentiment and behavior if you could efficiently and cost-effectively store and process all of this data with Hadoop. This information could help you calculate more reliable customer churn indicators and better predict best incentive offers to retain and acquire customers. Hadoop is a technology that can help you cost-effectively store and process both big transaction data and big interaction data in a reasonable amount of time.
So back to the original question -- to use Hadoop or not to use Hadoop. In theory, you might be able to process petabytes of unstructured data using traditional relational technology if you’re willing to spend an exorbitant amount of money. Depending on how much money you’re willing to invest, you would either be “sitting in traffic” waiting an unacceptable amount of time to process and analyze the data, or you would need to invest in several more “freeway lanes” (i.e. hardware and software licenses) to provision the additional storage and processing capacity to reduce latencies. Relational database technology is great for storing and processing large amounts of mostly structured data that must maintain transactional consistency and other ACID properties. With Hadoop you can cost-effectively store and process very large amounts of unstructured data (up to and beyond petabyte scale) fast using lower cost commodity hardware.
If you do not process or intend to process very large amounts of unstructured data, and you’re meeting your SLAs, then use what you have today. That is, just jump in your car and drive to work.
However, I caution you that while the traffic report may look good today, with data exploding around us traffic congestion is inevitable unless you consider your options. Therefore, I encourage you to look into extending your existing infrastructure with Hadoop. To get the most business value from your data including big transaction data and big interaction data, you need a data integration platform that can help you integrate all of your data on whichever platform makes sense – whether it is on Hadoop or traditional computing environments.
Hadoop is not a replacement for your existing data processing infrastructure. Given that most of your BI and enterprise applications are built on relational database technologies today, after the data is processed in Hadoop you need to deliver it to the business for decision support and improved operations by combining records from Hadoop and other environments.
Action Item: The Informatica data integration platform can help you transition into the world of big data, whether you decide to Hadoop or not to Hadoop. Informatica supports a hybrid IT infrastructure enabling you to access, integrate, and deliver your data anywhere whether on-premise or in the cloud, including Hadoop. Learn more at http://www.informatica.com/products_services/Pages/big_data_integration.aspx
Footnotes: