Online dating service eHarmony uses Hadoop processing and the Hive data warehouse for analytics to match singles based on each individual’s “29 Dimensions® of Compatibility”. According to eHarmony, an average of 542 its members marry daily in the United States.
A technology inspired by Google research and originated at Apache Nutch and Yahoo! as a technique to index and search Web content has expanded to big data use cases in multiple industries and the public sector. Hadoop is an increasingly popular option for processing, storing and analyzing huge volumes of semi-structured, unstructured, or raw data, often from disparate data sources. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios.
Enterprises use Hadoop in data-science applications that improve operational efficiency, grow revenues, or reduce risk. Many of these data-intensive applications use Hadoop for log analysis, data mining, machine learning or image processing. They benefit from Hadoop’s combination of storage and processing in each data node spread across a cluster of cost-effective commodity hardware.
The Hadoop Distributed File System (HDFS) and MapReduce address two of the fundamental attributes of big data: growth in enterprise data volumes from terabytes to petabytes and more and the increasing variety of complex multi-dimensional data from disparate sources, ranging from logs and call data records to GPS location data and environmental sensor readings.
For the third attribute of big data, applications that require faster velocity for real-time or “right-time” data processing, Apache HBase adds a distributed column-oriented database on top of HDFS, modeled after Google BigTable. For example, in addition to using Hadoop for Web analytics and to back up MySQL databases, Facebook uses HBase as a back-end for materialized views to support real-time analytics.
For most organizations, Hadoop is one extension or component of a broader data architecture. For example, AOL Advertising pairs Hadoop's capability for handling large, complex data volumes with Membase's support for sub-millisecond latency to make optimized decisions for real-time ad placement.
With a platform approach, you can incorporate Hadoop into a shared services model, where for example a Hadoop cluster is one option to spin out a temporary data sandbox for data scientists, analysts, or line-of-business users to study a particular problem or address “what if” scenarios. When their study is finished, those storage and compute resources can then be re-provisioned for other tasks, avoiding ongoing shadow data marts and protecting data consistency.
Another option with a platform approach is to use Hadoop clusters for data aggregation, pre-processing, and cleansing before loading verified data into an enterprise data warehouse (EDW) or production system, where the EDW continues to be the repository of record for a “single view of the truth”.
After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator.
Action Item: For more information, visit my Hadoop blog series at http://blogs.informatica.com/perspectives/index.php/author/brett-sheppard/
Footnotes: