Traditionally, data processing for analytic purposes follows a fairly static blueprint. Namely, enterprises create mainly structured data with stable data models via enterprise applications like CRM, ERP and financial systems. Data integration tools extract, transform and load the data from enterprise applications and transactional databases to a staging area where data quality and data normalization (hopefully) occur and the data is modeled into neat rows and tables. The modeled, cleansed data is then loaded into an enterprise data warehouse. This routine usually occurs on a scheduled basis – usually daily or weekly, sometimes more frequently.
From there, data warehouse administrators create and schedule regular reports to run against normalized data stored in the warehouse, which are distributed to the business. They also create dashboards and other visualization tools for executives and management. Business analysts, meanwhile, use data analytics tools/engines to run advanced analytics against the warehouse, or often against sample data migrated to a local data mart due to size limitations. Non-expert business users perform basic data visualization and limited analytics against the data warehouse via front-end business intelligence tools from vendors like SAP BusinessObjects and IBM Cognos. Data volumes in traditional data warehouses rarely exceeded multiple terabytes (and even that much was rare) as large volumes of data strain warehouse resources and degrade performance.
The Changing Nature of Big Data
The advent of the Web, mobile devices and other technologies has caused a fundamental change to the nature of data. Big Data has important, distinct qualities that differentiate it from “traditional” corporate data. No longer centralized, highly structured and easily manageable, now more than ever data is highly distributed, loosely structured (if structured at all), and increasingly large in volume.
- Volume – The amount of data created both inside corporations and outside the firewall via the web, mobile devices, IT infrastructure, and other sources is increasing exponentially each year.
- Type – The variety of data types is increasing, namely unstructured text-based data and semi-structured data like social media data, location-based data, and log-file data.
- Speed – The speed at which new data is being created – and the need for real-time analytics to derive business value from it — is increasing thanks to digitization of transactions, mobile computing and the sheer number of internet and mobile device users.
Broadly speaking, Big Data is generated by number of sources, including:
- Social Networking and Media: Social media is one reason data volumes are increasing. There are currently over 600 million Facebook users, 200 million Twitter users and 156 million public blogs. Each Facebook update, Tweet, blog post and comment creates multiple new data points.
- Mobile Devices: There are over 5 billion mobile phones in use. Each call, text and instant message is logged as data. Mobile devices, particularly tablets, also make it easier to use social media and use other data-generating applications. Many mobile devices also collect and transmit location data.
- Internet Transactions: Billions of online purchases, stock trades and other transactions happen every day. Each creates a number of data points collected by retailers, banks, credit cards, credit agencies and others.
- Networked Devices and Sensors: Electronic devices of all sorts – including servers and other IT hardware, smart energy meters and temperature sensors — all create log data that record every action.