Not Your Father’s Data Analytics


Traditionally, data processing for analytic purposes follows a fairly static blueprint. Namely, enterprises create mainly structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Data integration tools extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently.

(Read the entire Big Data Manifesto here, which includes market analysis, technical primers on Hadoop and MPP Data Warehousing, and action items for enterprises and vendors.)

From there, data warehouse administrators create and schedule regular reports to run against normalized data stored in the warehouse, which are distributed to the business. They also create dashboards and other visualization tools for executives and management. Business analysts, meanwhile, use data analytics tools/engines to run advanced analytics against the warehouse, or often against sample data migrated to a local data mart due to size limitations. Non-expert business users perform basic data visualization and limited analytics against the data warehouse via front-end business intelligence tools from vendors like SAP BusinessObjects and IBM Cognos. Data volumes in traditional data warehouses rarely exceeded multiple terabytes (and even that much was rare) as large volumes of data strain warehouse resources and degrade performance.

The Changing Nature of Big Data

The advent of the Web, mobile devices and other technologies has caused a fundamental change to the nature of data. Big Data has important, distinct qualities that differentiate it from “traditional” corporate data. No longer centralized, highly structured and easily manageable, now more than ever data is highly distributed, loosely structured (if structured at all), and increasingly large in volume.

Specifically:

  • Volume – The amount of data created both inside corporations and outside the firewall via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year.
  • Type – The variety of data types is increasing, namely unstructured text-based data and semi-structured data like social media data, location-based data, and log-file data.
  • Speed – The speed at which new data is being created – and the need for real-time analytics to derive business value from it — is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users.

Broadly speaking, Big Data is generated by number of sources, including:

  • Social Networking and Media: Social media is one reason data volumes are increasing. There are currently over 600 million Facebook users, 200 million Twitter users and 156 million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points.
  • Mobile Devices: There are over 5 billion mobile phones in use. Each call, text and instant message is logged as data. Mobile devices, particularly tablets, also make it easier to use social media and use other data-generating applications. Many mobile devices also collect and transmit location data.
  • Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day. Each creates a number of data points collected by retailers, banks, credit cards, credit agencies and others.
  • Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors — all create log data that record every action.
Traditional data warehouses and other data management tools are not up to the job of processing and analyzing Big Data in a time- or cost-efficient manner. Namely, data must be organized into relational tables — neat rows and columns — before a traditional enterprise data warehouse can ingest it. Due to the time and man-power needed, applying such structure to vast amounts of unstructured data is impractical. Further, in order to scale a traditional enterprise data warehouse to accommodate potentially petabytes of data would require unrealistic financial investments in new, often (depending on the vendor) proprietary hardware. Data warehouse performance would also suffer due to a single choke point for loading data. Therefore new ways of processing and analyzing Big Data are required.
(Don’t miss live coverage via #theCUBE and SiliconANGLE from Hadoop World 2011, November 8 and 9.)

, , ,