Massively parallel processing lets users analyze complete data sets

It’s statistics 101: the larger the sample size, the more accurate the results.

So if you want to analyze your customers’ behavior patterns – do they shop online or in stores, when do they make purchases, how often do they make returns – the more customer data can run through your analytics engine the better your results.

But what if you didn’t have to rely on sample data sets, but could analyze all your customer data? You can’t get any more accurate a picture of customer behavior through data analytics than that.

The problem is that data volumes are growing larger by the day and traditional databases and data warehouses simply don’t have the horsepower to keep up. They weren’t designed to ingest or process hundreds of terabytes of data day after day.

A new generation of data warehouses has emerged, however, and they were designed with the explosion of data in mind. Massively parallel processing (MPP) allows this new breed of warehouse to break up large data analytics jobs into smaller, more manageable chunks, which are then distributed to multiple processors. Analytics is run simultaneously – or in parallel – on each processor, then the results returned and synthesized.

A well-known example of MPP data warehousing in action is at MySpace. The social networking site boasts over 120 million active users monthly, which translates into over 10 billion ‘events’ per day, according to the company. MySpace deployed Aster Data’s MPP data warehouse appliance called nCluster so it could analyze all of its web traffic data, not just a sampling, in order to improve its marketing campaigns and to spot trends in user behavior.

One of the key benefits of analyzing complete data sets, rather than sample data sets, is that it removes the chances of missing less frequent but still critically important events or series of events. For a social networking site like MySpace, that could mean a spike in usage during certain times of the year that represents a potentially profitable marketing opportunity. Sample data sets could miss that spike.

Many of the new breed of MPP data warehouse from vendors like Greenplum and Aster Data are sold and deployed as appliances, with the software and hardware preconfigured for quick deployment.

Though perhaps a little late to the game, the mega-vendors are taking notice of MPP, with a number of them acquiring the technology from smaller players. Aster Data, for one, is in the process of being acquired by data warehouse stalwart Teradata. EMC purchased Greenplum last summer. And HP, which recently scrapped its homegrown data warehouse, NeoView, because it lacked scale-out capabilities, announced in February it will purchase Vertica.

Periods of consolidation in any IT area usually make customers and potential customers nervous, and rightfully so. No IT manager wants to purchase a new piece of technology only to have the vendor acquired and the product phased out. But that’s not likely in the case of MPP data warehouses.

EMC, HP and others are picking up MPP data warehouse vendors for their technology, not their customers (which are still low in number) or to simply eliminate rivals. Certainly there will be developments and improvements in the technology in the coming months and years, but its a pretty safe bet that MPP data warehouses will be around, in some form, for awhile.

That said, not all companies have such massive amounts of data as MySpace. If a traditional data warehouse is meeting your needs, investing in an MPP data warehouse might not be the right move at this point. But keep your eyes open. Data volumes are growing and show no sign of stopping or even slowing down.


John Furrier and Dave Vellante comment on how Greenplum is doing post EMC acquisition.

John Furrier shares his angle on HP Vertica Deal with Brian Jacquet from Roku weighting in.


, ,