The public cloud is gaining traction as a popular platform on which enterprise developers and data scientists can explore novel Big Data use cases and perform small proofs-of-concept. These often involve relatively small subsets of data extracted from enterprise-owned data sets, and projects are often brought back in-house when they move to production-level status.
This occurs for a number of reasons, from data security and privacy concerns to data movement and network constraints. By bringing Big Data projects back into corporate data centers, however, enterprise developers and data scientists are forgoing one of the major benefits of the cloud with respect to data-centric applications and analytics: simplified and cost-effective access to third-party data.
Significantly more value can be realized from Big Data projects when internal data sets – such as customer transactional data - are merged with third-party data that, when analyzed, reveal insights not possible to uncover with internal data alone. By bringing these data sources into the equation, business analysts and data scientists are more likely to discover game-changing insights that positively impact the bottom line.
By definition, third-party data is created and stored outside any given enterprise’s data center. Social media data – such as Facebook “likes”, Tweets and Pinterest comments – get the lion’s share of attention, but just as or more valuable are third-party data sets such as financial markets data, weather data, and aggregated industry-specific data. As these data sets “live” in the cloud to begin with, it sometimes makes more economic sense to access and integrate them with internal data via cloud-based Big Data deployments.
Put another way: It costs significantly less money and takes a lot less time to move a few terabytes of internal structured data from a corporate data center to the cloud than it does to move several petabytes of multi-structured data from the cloud to a corporate data center. Resulting high-value analytics and insight can then be brought back into corporate data centers.
Further, cloud service providers and data brokers have begun offering packaged data services that make it relatively simple to access and integrate new data sets with cloud-based Big Data projects (See Figure 1].
CIOs absolutely must still seriously consider the security and privacy implications of performing Big Data analytics in the public cloud. In particular, when third-party data is brought to bear in Big Data projects, the resulting analysis and insights often take the form of new data sets that are themselves significantly more sensitive than what came before. (Consider the recent controversy over research from the University of Cambridge that illustrated how traits such as sexual orientation can be determined by analyzing seemingly unrelated Facebook likes.)
Further, some enterprises for which Big Data analytics serves as a primary source of competitive differentiation (and this will be the case at more and more enterprises across vertical industries as Big Data projects mature) will likely determine the cost of investing in and developing its internal Big Data core competency (both technology and people) outweighs the benefits of outsourcing to the cloud.
Action Item: CIOs should think seriously about deploying production-level Big Data projects in the public cloud when incorporating significant amounts of third-party data is involved. Perform a thorough cost-benefit analysis that takes into account the time and financial costs of identifying, integrating, and analyzing third-party data in both internal and cloud-based Big Data deployments. Do not, however, overlook the security and privacy implications of Big Data in the cloud and push cloud service providers to provide detailed accounts of security policies/capabilities in order to de-risk cloud deployments.
Footnotes: