What is AWS Neptune?

The AWS Neptune graph database is designed to store a wide collection of complex relationships as a scalable service. It supports a number of different and evolving standards for representing knowledge and complex networks as graphs and has recently added hooks for a Graph Store Protocol, openCypher, Neptune ML, and TinkerPop Gremlin to its wide array of supported APIs.

Running on the AWS cloud, it is an important new member in the increasingly competitive field of graph databases. Notably, Amazon is focusing on integrating AI routines from the company’s AI service SageMaker to AWS Neptune. That is meant to create a hybrid tool that both stores and analyzes data.

Graph databases store large collections of relationships between objects, people, ideas or any other entity that might be represented in a database. While relational databases do well with recording fields of data and one-to-many connections, graph databases are optimized to track many-to-many relationships, like social networks (who knows who) and concept networks (which ideas are connected to which others).

Some of the natural use cases for graph databases like Neptune are:

Fraud detection — Criminal behavior often falls into a predictable pattern, and graph databases are useful for finding patterns based on connections between events. A series of bad events using the same physical or IP address, for example, could lead to flagging future events with the same addresses for scrutiny.
Recommendation engines — If the graph can link similar items, a simple algorithm can offer users help finding new friends or potential purchases by following these links.
Knowledge graphs — One of the more sophisticated options is to create a network of relationships between abstract ideas, thoughts, and concepts. This can act as the foundation for more sophisticated search algorithms, language translation, or other forms of artificial intelligence.
Money laundering monitors — Some regulations ask financial institutions to track the flow of currency to help prevent crime. Graph databases are natural options for modeling transactions and detecting net flows.
Contact tracing — Epidemiologists often work to control the spread of disease by tracking how and when people meet and interact. Graph databases often have algorithms for tracing the flow through multiple hops.

Neptune supports the two major conceptual models for graph data processing (property graph and RDF) and the various query languages for each of them. Users can choose a particular model when creating the database tables, but these are not easily interchangeable after creation.

Developers have a number of options for working with Neptune. Data can be inserted or queried with any of these protocols:

Gremlin, for accessing property graph data, from the Apache TinkerPop project
openCypher, another option for querying property graph data, from Neo4J databases
SPARQL, for searching RDF data, from the W3C
Bolt, a binary version of the openCypher protocol, from Neo4J

AWS Neptune is also designed like other Amazon databases to hide much of the complexity of installing the software or scaling it effectively. The service will replicate data to create read replicas across datacenters and availability zones. Backups can be triggered automatically to S3 buckets. If any node fails, other replicas can take over automatically.

Neptune pricing depends heavily on usage. The bill rolls together the power of the computing ($0.098 per virtual machine hour and up), the amount of storage ($0.10 per GB-month), and the number of queries ($0.20 per 1 million requests). Backups can be cheaper at ($0.02 per GB-month in the US East). There is a free amount of data transfer, but after the first terabyte it will start at $0.09/GB and drop with volume.

The integration with Amazon’s SageMaker offers the opportunity to let the machine learning tool classify graph nodes and edges according to their attributes and the attributes of nodes or edges connected to them. It can also determine the most likely connections based on a dataset, allowing it to offer predictive paths.

Some applications of this machine learning option include tasks from the physical world, like finding routes or paths through geographic data that’s been turned into a graph model. Other, more abstract tasks — like knowledge synthesis — depend on graph models built from text or conceptual networks.

How are established firms competing?

The older databases are adding graph capabilities to their existing databases as another type of table. Oracle’s solution can also model either property graph or RDF data under the umbrella of its major database. These players added graph searching capabilities to their query language and created a collection of tools like Graph Studio that make it easier to extend existing datasets to use the graph capabilities.

Microsoft added property graph modeling capabilities to the Azure Cosmos DB service. Queries can be built using Gremlin to search the nodes that are automatically replicated. The company has also added node and graph objects to SQL Server, making it possible to store graph information alongside other relational data.

IBM added the Apache TinkerPop analytics framework to Db2 so queries written in Gremlin can work alongside more standard SQL requests.

How are the upstarts competing?

Founded in 2007, Neo4J is one of the leading graph database companies and is responsible for developing some of the standards Neptune is emulating. It supports Neo4J, one of the first successful graph databases. The company has grown steadily and recently raised a round of funding at a $2 billion valuation, making it far from a startup but not in the same range as the biggest companies in the space.

In interviews, Neo4J’s leadership team cites the company’s moderate size as an advantage because it focuses on building the best graph database ecosystem, rather than dabbling in every technology. The tool is also easily downloaded, allowing companies to run it both in the cloud and on-premises. The software can run locally, in a preconfigured image on the major clouds, or in Neo4J’s proprietary Aura cloud.

A few other graph databases continue to grow. ArrangoDB also offers an enterprise version that can run on your own machines or as a preconfigured instance in the major clouds. A community version without some of the features for supporting large, multi-machine clusters is also available for those who want access to the source code. ArrangoDB bills itself as “multi-modal” because nodes can either act like NoSQL key/value stores, parts of a graph, or both.

TigerGraph is also designed to tackle big datasets and can be used either on local hardware or through a subscription to a service in TigerGraph Cloud. It’s designed to handle larger datasets using some of the Apache Hadoop or Spark. Queries are written in GSQL.

Dgraph is a distributed graph database available either with the Apache license or with a set of proprietary enterprise-grade layers for creating larger, multi-machine clusters. The main query language is GraphQL, created by Facebook.

JanusGraph is a project of the Linux Foundation supported by a number of companies, including Target. The database is designed to work with some of the big NoSQL databases, like Apache HBase, Google’s Bigtable, and Oracle’s BerkleyDB. Analysis of the data can be done via some distributed MapReduce frameworks or Apache Spark.

Is there anything AWS Neptune can’t do?

Support for Property Graph and RDF give Neptune broad appeal for many projects, including those that will use both architectures. But the support is not complete, and Neptune doesn’t offer all of the features in the various standards. For example, inference queries for RDF data aren’t available yet, reportedly because they slowed performance.

Available solely as a cloud service, AWS Neptune also differs from AWS offerings like Aurora because the core software is not available as an open source distribution, and developers can’t run local versions or move off of AWS hardware.

Source link