On October 10, 2013, MarkLogic Corporation kicked off its customer and partner Summit Series in New York City with several exciting announcements related to the latest version of its industry-leading Enterprise NoSQL database platform, MarkLogic 7.
A press release distributed that morning to news outlets and the IT analyst community highlights major enhancements to MarkLogic 7, including capabilities for customers to reduce storage costs by leveraging cloud infrastructures and tiered storage, a native Hadoop interface to enable data to easily pass between the two systems, and a new semantics option that stores “RDF triples, documents and data in the same proven Enterprise NoSQL database, maintaining context and making the facts available for decision-making.”
In addition, MarkLogic adds new features that enable searchable storage tiers and data compression, additional search and query options and government-grade security options. In the press release, CEO Gary Bloom states, “The promise of the cloud and Hadoop can only be realized when coupled with a database platform that enables elasticity and flexibility with the assurance of data consistency and security.”
At the Summit, Bloom praised MarkLogic’s customers for their contributions. “These modifications and new capabilities included with our MarkLogic 7 release came about as a result of feedback and hard work from our loyal client base, some of whom have been with us for a decade.” Regarding semantics, Bloom predicts, “12 months from now, semantics is all that the NoSQL community will be talking about.”
300 Customers in the Early Access Program
More than 300 users representing 70 unique enterprise organizations, including customers, prospects, and partners, have been working with MarkLogic 7 through the Early Access Program. Semantics use cases were developed and tested across a range of industries from financial services, media, and publishing to government agencies, healthcare, and pharmaceutical companies.
During a panel discussion at the summit, customers applauded version 7, slated for general release in late 2013, for its semantics, tiered storage and elasticity features as well as MarkLogic’s commitment to improving ease of use. Two long-time customers talked about their experience with MarkLogic 7 and the motivations behind their shift to this latest version.
Sanjay Anand, a product engineer at Zynx Health, a division of Hearst Corporation and a market leader in providing evidence-based clinical decision support systems that help healthcare organizations improve patient outcomes and decrease costs, talked about the difficulty of updating thousands of documents in Zynx’s legacy relational database system. Although he was able to complete document updates with MarkLogic 6, with MarkLogic 7 the process is “seamless.”
Mike Bowers, principal architect at the Church of Jesus Christ of Latter-day Saints (LDS), has been using MarkLogic since Version 3 and said that “MarkLogic is way ahead of the game.” With 15 million members, and millions of unique visitors every month to the church’s website, Bowers knows the complications in managing Big Data. He believes the church’s analysts and MarkLogic programmers will benefit greatly from faster access to data on tiered storage, improved storage utilization through compression managed by MarkLogic, and the semantics capabilities.
Excitement for NoSQL
More than 50 NoSQL database engines are now on the market, some of which have their roots in open-source solutions going back two decades. According to a variety of sources, including Wikibon, MarkLogic is the market share leader in the combined Hadoop/NoSQL space. Bloom says, “We’ve got a five-year head start on all the other NoSQL solution providers.”
Very recently, funding for NoSQL solutions has accelerated, with competitor MongoDB announcing two weeks ago a $150-million funding round, which gives the company a valuation north of $1.2 billion. In aggregate, funding for the Hadoop/NoSQL space is approaching $1 billion, which bodes well for supporting growth and valuations across the NoSQL solutions landscape. When asked to comment, Bloom quipped, “They’ll (MongoDB) use the money to get to the level we’ve already achieved.”
Genesis of MarkLogic
MarkLogic is the brainchild of Chris Lindblad, who founded the company over a decade ago to address what he believed was a major gap in how databases handled unstructured data such as documents and other text formats. A former architect at Infoseek working on the Ultraseek enterprise search platform (now part of the Verity search owned by HP-Autonomy), Lindblad spoke at the summit about his vision for document-oriented databases and the major role XML, the most ubiquitous document format, plays in managing unstructured, mostly text-based data.
Lindblad says he realized that enterprises were going to be limited in what they could do with their text-based data unless someone figured out a different way to ingest, index, and store it. “Before Big Data was called Big Data, the MarkLogic team spotted the ‘publishing problem’ that plagues every large enterprise: Digitize content or die. That led the team to build the first-ever NoSQL database.”
MarkLogic Special Sauce
Today MarkLogic is well-known throughout the big data space for merging capabilities inherent in traditional relational databases (RDBMSs) and SQL query languages with the agility and innovation found in the NoSQL solutions world. (NoSQL stands for Not only SQL).
MarkLogic defines its solution as “a document-centric, transactional, search-centric, structure-aware, schema-agnostic, programmatic, high-performance, clustered, database server.” Utilizing a hierarchical data model, MarkLogic supports “any structured” data in compressed binary “trees.” Ostensibly, MarkLogic 7 handles virtually any kind of structured, unstructured or semi-structured data, from documents, image metadata and video to spreadsheet and financial data.
To manage inserts, updates, and concurrent reads, MarkLogic uses Multi-Versioning Concurrency Control (MVCC), appending changes and using time stamps to track the birth and death of a document. MVCC benefits include support for ACID transactions, (critical to the financial industry), very fast updates, large sequential block writes (which give you fast ingestion and the ability to run on block storage), point-in-time recovery, fast database rollback, and lock-free reads.
To maintain its high performance and skimp on storage space, MarkLogic stores XML document data in a highly efficient manner utilizing a binary coding scheme. A whitepaper entitled Inside MarkLogic Server explains the process in detail on page 33: “The tree structure of a document gets saved using a compact binary encoding. The text nodes get saved using a dictionary-based compression scheme. In this scheme, the text gets tokenized (into words, whitespace and punctuation) and each document constructs its own dictionary, mapping numeric token IDs to token values. Instead of storing strings as sequences of characters, each string gets stored as a sequence of numeric token IDs. The original string can be reconstructed using the dictionary as a lookup table.”
New features and capabilities in MarkLogic 7 are built on top of a proven architecture that simultaneously supports multiple interfaces to industry standard development and query tools, a number of purpose-built indexes, connectors to popular applications and, in addition to XML, text-based open standards such as JSON.
Hadoop: The Elephant in the Cloud
As mentioned above, MarkLogic 7 adds support for cloud computing and now runs natively on top of the Hadoop Distributed File System (HDFS) as well as offering new features that enable searchable storage tiers and data compression.
These enhancements allow customers to do more with Hadoop environments, including easily and securely moving – or attaching and detaching – data between MarkLogic 7 and Hadoop clusters; reducing costly extract, transform and load (ETL) processes; and minimizing duplication of effort as well as allowing mission-critical apps to run directly on HDFS. MarkLogic specifically leverages the MapReduce part of the Hadoop stack (a popular tool for running Java-based, computationally intensive programs across a large number of nodes) to facilitate bulk processing of data.
The ability to support a variety of cloud implementations (private, public and hybrid) has become critical for customers with applications that need “elasticity” to support bursts or lulls in activity and also augment test and development environments. MarkLogic has created a new flagship offering – MarkLogic Global Enterprise and, in doing so, has both lowered the price and included features that were previously add-ons as standard. MarkLogic has also introduced a new edition – MarkLogic Essential Enterprise – that is offered with Perpetual, Term and Cloud licensing. This edition can handle all but the largest, most complex applications, and is priced at 1/3 the cost of MarkLogic’s previous Enterprise Server edition. In addition, MarkLogic offers customers the choice of perpetual, subscription or pay-as-you-go pricing for an enterprise-ready database.
Fast or Cheap, Tiered and Compressed Storage
MarkLogic 7 has added a number of storage management features that allow customers to maximize their data storage assets without imposing burdens on developers or users. Mark Logic 7 offers IT operations the ability to define data tiers based on a “range” index. Categories include Active (a high-performance tier that can utilize SSDs or Flash storage), Less Active (lower cost, lower performance, hard disk drive storage), Historical (for online archived data) and Archive (offline archive). Range indexes are also used to support faster queries and faceted searches.
A system administrator can even create a range index against any field, enabling “sort by, constrain by, and extract values not even directly present in the documents” (from Inside MarkLogic). Moreover, MarkLogic allows customers to dynamically move data along multiple tiers based on user-defined categories. Anything you can search, you can tier off.
In addition, MarkLogic Tiering provides multiple service level agreements (SLAs) within a single system (multi-tenancy), decreases the time and cost of ETL to bring offline content (in Hadoop) back online, allows for an entire tier to be moved to another storage asset and offers the ability to query a single tier or multiple tiers at once – all with no downtime and 100% data consistency. System or programmatic upgrades or changes on-premise or in the cloud, such as adding nodes or upgrading to new software releases can also be done without taking the system down.
Finally, due to updating its compression algorithms, including MarkLogic’s proprietary binary coding scheme to reduce data and index size, MarkLogic estimates that version 7 reduces its storage utilization by 30% to 66% over version 7 – not to mention how much more efficient it is than competitive database solutions.
The Semantic Triple Jump
MarkLogic 7 offers semantic technologies for organizations looking to gain additional insights by allowing their users to leverage facts from many sources – the document itself, the domain and the world. Semantics look at the context of content to help the user get the facts associated with their data. Data + context = information.
The MarkLogic 7 triple store manages and indexes a collection of “facts” expressed in semantic triples and, according to MarkLogic, can efficiently query and join billions of “linked data” triples. A triple includes a Subject (Harry) Predicate (lives in) and Object (New York).
With semantics, customers can set rules that help to surface relationships in order to learn something that is not represented anywhere in their data. Linked data can tell the user something about their data even if they weren’t looking for it.
Sites like the BBC find semantics particularly useful. With thousands of pages on thousands of different topics, the BBC needed an automated, dynamic way of updating these pages. For example, if a new story appears about Lampard (a footballer who plays for Chelsea), the system knows that this is relevant to the Chelsea page on BBC as well. MarkLogic 7 offers the potential to simplify the BBC’s architecture by managing semantics and media assets in the same place.
According to the press release, “MarkLogic Semantics, with a specialized triple index, enables industry-standard SPARQL queries [sometimes referred to as SQL for semantics] combined with queries against documents and data, so that all relevant information can be delivered in applications and analytic reports.”
Conclusion and Analysis
MarkLogic 7 has added many advanced features and functions, including more robust storage management functionality, a native Hadoop interface, improved elasticity and semantics capabilities that will undoubtedly satisfy existing clients and will likely attract a bevy of new clients.
MarkLogic also has a market share and feature/functionality advantage over most of its competition, of which there is plenty. Many NoSQL vendors now claim to offer ACID transactions and advanced security capabilities, while others are focused on the mobile, in-memory, graphic or distributed DB opportunities within a dynamic, yet fragmented market.
Perhaps the greatest challenge for MarkLogic will be whether or not MarkLogic 7 and subsequent versions can not only co-exist, as they do today, with entrenched relational database vendor solutions (which represents a $25-billion market) such as Oracle, IBM and Microsoft deliver, but rather replace these solutions and their legacy SQL tools and services organizations.
Meanwhile, MarkLogic is helping its customers solve real-world problems that could not easily or efficiently have been solved by traditional relational database management solution providers, as traditional RDBMS scale-out architecture does not lend itself to cloud-scale, highly available , document-oriented and distributed database workloads.
The race is on to see which of the 50 or so NoSQL database solution providers will win a significant portion of the hearts and minds of application owners, CIOs and business users. With MarkLogic 7, MarkLogic once again raises the bar for the entire DB industry jumping to the head of the NoSQL class with game-changing, enterprise-ready capabilities.
Action Item: There are signs that NoSQL solutions are disrupting legacy RDBMS opportunities for new database application business, if not yet replacing existing databases. Speed to market, innovation, availability, scale-out, ease of use and cost are all disruptive factors for NoSQL vs. RDBMS. CIOs and IT executives must review their application portfolios - especially Web-based transactional systems and Big Data-related analytics projects - to determine if NoSQL solutions are the best database fit for these specialized solutions.
Footnotes: Click on the following link to view a complimentary report by Gary MacFadden covering all the key updates and features of MarkLogic 7.