The demands of analytics-driven real-time decision making, which require millions of compute decisions to be made every few seconds and near-instantaneous access to multiple terabytes of data, are transforming the infrastructure to support these high transaction volume applications.
While advances in multi-core server technology and the advent of NoSQL databases are both key components of the overall solution, the commercialization of high capacity, low cost SSD or flash storage arrays makes possible a solution that can not easily or effectively be achieved utilizing traditional hard-disk drives (HDDs) or even an all DRAM in-memory alternative.
During Wikibon’s November 27th 2012 Peer Incite call, Dag Liodden, Co-Founder and CTO of Tapad shared his firm’s rationale for selecting flash as the primary data storage asset for the company's high performance cross-platform ad solution, which offers its customers, such as Dell, online ad retargeting services across multiple devices, browsers and mobile apps. (Follow this link for Wikibon Big Data Analyst Jeff Kelly’s overview of Tapad.)
Why Flash Works Best According to Liodden, latency and access patterns dictate the type of storage needed for Tapad's application to read and write every device key value as well as support specific types of querying and updating of user IDs. Tapad can store kilobyte-scale key values in RAM, however it has collected information on billions of devices, the data for which, including replicated back ups, has grown to exceed 3.5 terabytes. In addition, access patterns are random, so the system needs to assume that all the data is “hot” and have real-time access to the entire data set.
Liodden says that loading almost 2 terabytes of active data in an in-memory solution would not only be cost prohibitive, it would entail the use of several more servers in addition to lacking inherent caching heuristics for Tapad's ad exchange model. “Using an all RAM configuration might be faster than a RAM and flash approach but it doesn’t afford us the predictability and lower cost of keeping all the data on flash. Plus, if a server or node goes down, it takes much more time to perform a RAM boot and load that data back into memory. Latency per request is not as predictable with RAM, and you can’t predict hot spots. The RAM bus speed may also be a bottleneck.”
Liodden believes traditional HDD arrays still have uses for some Big Data or Hadoop and MapReduced applications that may be CPU bound more than IO intensive. “In many cases HHD or rotational drives have high bandwidth when reading sequential or linear data such as counting the number of unique users visiting a site or summing up the number of page views while sifting through massive data sets. For smaller data sets with random access patterns, SSD is much better.”
Tapad selected NoSQL database solution provider Aerospike to help support its ad exchange application. Liodden uses traditional relational databases for other in-house applications but found the NoSQL approach much more appropriate for the integration of real-time analytics and high-speed database reads and writes needed for his ad exchange environment. “The data set keeps growing, and our traditional relational database would need to be modified or clustered to handle the volume of data we have. It was just much easier and straight-forward to go with the Aerospike option.”
Aerospike also offers a Flash-Optimized Data Storage Layer, which reliably stores data in DRAM and flash as well as software that supports wear-leveling on SSDs – a common concern with high transaction volume environments. By design, flash drives can only accept a limited number of electrical charges. However, Liodden says Tapad has been running its flash array in a production environment for more than 18 months and has yet to replace an SSD drive. “Given the original cost of the SSDs, they have already more than paid for themselves, and the replacement cost continues to go down.”
Bottom Line Traditional HHD storage and relational databases are not going away, as they will retain their usefulness for many existing applications. However, for high-performance and real-time applications that rely on access to active data, deploying SSDs and flash storage arrays is quickly becoming the best practice. In many cases, flash is not only more cost-effective than DRAM in-memory solutions, flash offers more predictability and resiliency.
Action Item: Application owners of real-time analytics and high transaction volume solutions that require access to terabyte-plus volumes of active data and must forgo sequential processes in favor of integrated processes to meet latency requirements need to consider flash storage arrays over traditional HDDs and even all-DRAM in-memory solutions. In addition, IT executives should consider deploying all metadata as active data on flash arrays to enable the repurposing of that metadata for new value creation.
Footnotes: