Problem: Zettabytes of Data
Primary Data came out of stealth in November 2014. David Flynn, a co-founder, CTO and architect of the Primary Data solution, was the CTO (later CEO) and chief architect of Fusion-io, and made major contributions to flash as an extension of DRAM technologies, before being ousted by investors wanting to cash out to SANDisk for about $1 billion.
The problem Primary Data is addressing is the exabytes of data locked up in storage arrays, cloud services, and tapes. Each storage array family & cloud service is different, with unique data services. The data itself includes all the information about the data, the metadata. Each storage array is an island, each file a rock on the island. Sure you can connect arrays together in NetApp's ONTAP 8 storage virtualization network and move the rocks round the island; it is still an island, and the data still includes all the metadata. Sure, EMC’s ViPR allows file systems to be created across different storage arrays, but the data services are still within the storage array, and the data itself includes all the metadata.
The expectation from the Digital Universe Study is that more than 6 Zettabytes of data is stored in the universe in 2014, growing to 44 Zettabytes by 2020. The devices include everything from smart phones and thumb drives to data centers, mobile clouds, and enterprise clouds to the Internet-of-things. Governments have some of the largest problems. The general problem is knowing what data is stored where, when it was stored, and how it can be accessed. The heart of the problem is managing Big MetaData.
Primary Data Solution: True Data Virtualization
The key to Primary Data's approach is separating the metadata, the data about the data, from the data:
- All access to the data (from user, application or service) is initially through the control channel to the metadata that establishes data access channels to the data.
- Access to the data is protocol agnostic, and it does not matter where the actual data is stored or how it is accessed. The data can be direct attached, network attached, or can be in private or public clouds.
- Access to the data is also storage media agnostic. The data can be held on flash, traditional magnetic disk drives or magnetic tape.
- Access is also storage type agnostic. The data can be block, file. or object.
Add some data pumps to the mix to move data around dynamically, and you have true virtualization of data, where the consumer does not need to know anything about format, location, protocol, etc. Data can be moved to the appropriate layer programmatically, with far more data to ensure the optimum balance of cost & performance and time to first and/or last byte. Data can be placed in traditional storage arrays/filers or in public clouds according to cost and performance criteria. Initial searches for data can use extensive querying of the metadata to speed up searches and data retrieval. Global de-duplication metadata (particularly for copies of data) could reduce the storage required by factors of ten or more. The creation of Big MetaData allows multiple uses of data, e.g., business record, warehouse item and archive.
Managing Big MetaData
The key to this approach is ultra-high availability of the metadata with very fast access. One enabler is flash storage, architected as an extension of DRAM (when at Fusion-io, David Flynn was probably more interested in achieving this software goal that shipping boatloads of flash to Apple and Facebook), with ultra-fast networks between clustered local servers and fast connections to distributed clusters - think EMC DSSD (Distributed SSD) on steroids.
Primary Data claims to be using mainly open-source software (extending it where necessary and contributing back these extensions). The Big MetaData management layer is the key area where Primary Data is focussing development of proprietary IP.
Action Item: Primary Data have gone after the Big MetaData problem with a bold solution. The potential benefits for organizations with piles of data is enormous, both for data and cost reduction, and as an enabler of Big Data analytics, search and active archiving. Like other large-scale solutions such as CleverSafe and DDN’s WOS, government agencies are very likely to be the initial implementers to manage multiple sources of disparate data. CIOs and CTOs should keep a keen eye on Big MetaData management in general, and Primary Data in particular.