Feature Graph Databases

How graph databases work

Relationships

Graph databases excel at revealing the relationships among different but related sets of data. By Jens-Christoph Brendel

Typical relational databases map relations indirectly by means of joins, whereas graph databases map the same relations directly. One example of the use of a graph database involves the so-called Panama Papers. Tax fraud, money laundering, breaches of sanctions, and a number of other criminal offenses were levied against clients of the Panamanian offshore service provider Mossack Fonseca. In 2016, journalists associated with the International Consortium of Investigative Journalists (ICIJ) from 107 newspapers, television stations, and online media in 80 countries uncovered the crimes by jointly analyzing some 11.5 million leaked email messages, database entries, letters, fax documents, deeds of incorporation, loan agreements, bills, and bank statements with the help of a graph database.

Relations

Alice likes Bob who follows Carol and Dave on Twitter. Streetcar line 15 runs from Max-Weber-Platz via Giesing to Gr¸nwald. Mr. Huber is the superior of Mr. Meyer and Mrs. Schulze. Theresa liked the video that Mike recommended. Vehicle is the generic term for car, motorcycle, and truck. Modern man is related to early hominids and other species of great apes.

All these statements have one thing in common: They are about relations – about connections between people or things for which you could intuitively sketch something tree-like on a piece of paper. Such relationships are becoming more important in a world that is becoming increasingly networked, where everything is connected with everyone.

It is certainly no coincidence that relations are at the heart of the business ideas of corporations that are now among the most valuable and powerful in the world – Facebook and Google, for example. In the first case, it's about relationships between people who know each other; in the second, it's about relationships between websites that link to each other. The often-quoted page ranking is the measure of these connections. Networking is the common denominator between these models, and relations are its underpinnings. This is also true for countless other applications where relations play the main role, be it in purchase recommendations, dating portals, fraudulent transaction detection, genetics, logistics, and so on and so forth.

Of all data structures, one is particularly well suited to representing relationships – the graph. Graph databases, in turn, can store graphs, expand or shrink them, update and rearrange them, and answer questions about their content. Other developments have certainly attracted more public attention, but, out of the spotlight somewhat, the trend curve of the "graph database" search term on Google has been rising steadily for the past 10 years.

In parallel, a great variety of graph databases have emerged, including the originals like Neo4j [1], distributed databases like Titan [2], free databases like JanusGraph [3], proprietary databases like Amazon Neptune [4], and established relational or NoSQL databases that have added the ability to handle graphs. An example of the latter is the DataStax Enterprise Graph [5] database, which combines the open source Cassandra [6] database with the also free Apache TinkerPop [7] graph computing framework to create a successful product.

Today, graph databases are in production use in almost all industries in leading global corporations, including – to quote just a few names from customer lists – NASA, Walmart, Siemens, Google, Samsung, eBay, IBM, Airbnb, Novartis, Volvo, and many others. If you want to look into graph databases, a short crash course in graph theory will certainly do no harm (see the "Graph Theory" box).

Graph Theory

This branch of mathematics was founded almost 300 years ago by the Swiss mathematician Leonhard Euler. He found a solution to the Königsberg (now Kaliningrad) bridge problem, which considered the question as to whether one can tour the city by crossing each of the seven bridges over the Pregel River exactly once. Euler proved that this was impossible. The potential routes through the city form a graph – even if Euler did not call it that at that time (Figure 1).

Figure 1: The Königsberg bridge problem: Can you cross all bridges precisely once? Euler proved that no solution exists.

Generally speaking, a graph is a structure of vertices and edges that connect nodes. Applied to information processing, the vertices always represent an element (person, thing, place, date, organization, category, etc.), and the edges represent the relationships between two vertices. The edges can be directional, in which case they are referred to as arcs, and the graph with these directed edges is called a digraph. The edges also can have assigned attributes (e.g., path lengths or costs), in which case it is referred to as a weighted graph. The same graph can often be represented in different ways, and the equivalent representations are referred to as being isomorphic (Figure 2).

Figure 2: Although not obvious at first glance, these isomorphic graphs are equivalent representations.

If a representation has edges that only cross the nodes, the graph is planar. An application in architecture states that if the paths between rooms form a planar graph, the rooms can be on one floor; otherwise, they would have to be spread over several floors, and you would need stairs.

The number of edges that meet in a node is known as the degree of the node. Euler found that you can find a way through all nodes in which no edge repeats (an Euler circuit) if, and only if, the degree of each node is an even number, which is not the case in the Königsberg bridge graph. The mean value of all degrees of the nodes of a graph is the branching factor. If all legal moves that occur in the course of a chess game are shown as a graph, the branching factor is between 31 and 35. The chess player can therefore choose between, on average, 31 and 35 possible rule-based moves at any point in the game. The branching factor in Go is 250, which is why, for example, it took much longer for computers to outperform humans in Go.

A common optimization problem is to modify a graph with the least amount of intervention possible in such a way as to create an Euler circuit (Eulerization), wherein the degree of each node must be even. A circuit that visits every vertex exactly once and starts and ends at the same vertex is known as a Hamiltonian circuit, whereas a Hamiltonian path visits every vertex exactly once, but it does not have to start and end at the same place. Finding a Hamiltonian path is relevant for all pick-up and delivery routes, for example.

In short, a finite sequence of the type node-edge-node-… is a path. If the starting point is equal to the end point, the path is closed; otherwise, it is open. A closed path with different vertices and edges is known as a circuit. A simple graph without circuits is a tree. Visiting each vertex and edge in a graph is a traversal; a tree traversal is a special case of the graph traversal.

A sample application of the paths concept can be found in project management, whose processes can be represented as a directed graph, in which the edges are usually weighted with a time duration. A route that contains the longest chain of interdependent steps determines the minimum project duration and is known as the critical path; the corresponding scheduling method is the critical path method (CPM). Any delays on this path have a direct effect on the overall project duration.

RDF and LPG

Graphs can represent many situations: computer networks, subway maps, organizational charts, flowcharts, molecular structures, decision paths, circuit paths, social networks, family trees, knowledge content, supply chains, and much more. Computer science uses several coding models, including the resource description framework (RDF) model [8] and the labeled property graph (LPG) model introduced by the Neo4j database.

The RDF model comes from the semantic web environment, which aims to enrich data with contextual information. RDF describes graphs as triples with the form subject-predicate-object. The first part of each triple is the subject node, followed by the predicate, which represents the edge and describes the nature of the relationship in more detail. At the end is the object, which can be a simple literal value. An example that could show a simplified section of a relationship network on Twitter might look like this:

Hans – follows – Werner

Werner – follows – Hans

Thomas – follows – Hans

In reality, however, the subject, predicate, and object in the RDF model are unique Unified Resource Identifiers (URIs), which are needed to mix information from different sources. (The universally known URLs are a special form of URI for the designation of Internet pages). The elements of an RDF triple, the URIs, are plain references – or simple literals in the case of some objects – that have no internal structure. One advantage is their uniqueness.

A bibliographic reference to the website http://www.example.org/index.html stating the author, date of creation, and language could look like Listing 1 in the RDF model. In the example, the URI http://purl.org/dc... conforms to the Dublin Core metadata standard.

Listing 1: RDF Triple Examples

<http://www.example.org/index.html>
<http://purl.org/dc/elements/1.1/creator>
<http://www.example.org/staffid/94640>
.
<http://www.example.org/index.html>
<http://www.example.org/terms/creation-date>
"May 10, 2019"
.
<http://www.example.org/index.html>
<http://purl.org/dc/elements/1.1/language>
"de"
.

LPG is slightly different, in that the nodes and edges have an internal structure. Each node and edge is its own small key-value store that stores its attributes. Different types of nodes or edges can contain a different number of these key-value pairs. Because the attributes are not extra nodes, unlike the RDF model, the representation is more compact. Additionally, individual properties of relationships and multiple relationships of the same type between subject and object can be more easily mapped in LPG format. However, the meaning of the freely definable key-value pairs attached to the nodes and edges is not defined uniquely in the same way as in the RDF model with URIs. A sample of a schema for storing Twitter tweets in LPG would look something like Figure 3.

Figure 3: A possible schema of a Twitter tweet graph in an LPG model.

Graph Databases

Some graph databases use the RDF model and others use the LPG model. Additionally, databases can be distinguished according to whether they are native graph databases that use a storage engine designed specifically for graphs or non-native graph databases that serialize the graphs in some form and then store them with a proven relational or NoSQL storage engine. Both have advantages and disadvantages. For example, native databases are usually faster, whereas non-native databases rely on a long known and well understood and optimized technique.

Query languages are another distinguishing feature. Unlike the relational world, no single standard such as SQL has been established; rather, a number of different query languages have emerged, including the widely used Cypher, Gremlin (as used in this article), SPARQL, GraphQL, and others. Although graph databases are mainly used for online transaction processing (OLTP), graph compute engines process large batch jobs with graph data (online analytical processing, or OLAP).

Both techniques are best suited for scenarios in which the object is less about retrieving discrete facts than related things, or relationships. Anything that can be represented as a network or tree is a meaningful task for graph databases, which includes anything that has to do with finding patterns or hotspots in a network. Such cases can only be represented inefficiently and with poor performance in the relational model because the relationships have to be represented by joins and special tables and cannot have any special properties.

First Steps

The Gremlin Console is one easy way to experiment with a graph database. Gremlin is the query language for Apache TinkerPop, an abstraction layer for multiple graph databases and processors that uses the LPG model.

Gremlin supports declarative pattern matching, which means that what is described is the problem, not the how or the solution. It is also path-based, meaning that queries follow a chain of nodes. Among others, Neo4j, OrientDB, DataStax Enterprise, Hadoop, and Amazon Neptune can use the Gremlin query language – sometimes in parallel with other query languages, which sometimes (e.g., SPARQL) can also be translated into Gremlin code.

Installing the console requires only downloading and unpacking the current version [9] on a Linux computer. If Java 8 is also already installed, the console is immediately ready to launch. A database – the Apache TinkerGraph in-memory model – is also included as a plugin. A selection of sample data is even included in the package.

To start the console from the directory created during unpacking, type bin/gremlin.sh. A very small sample database (Figure 4) can be created with the command from the first line of Listing 2, which loads the Modern database from the subdirectory data/, so named because it is a more modern version of an older example with six nodes and six edges.

Structure of the Modern database included in the Gremlin Console package. — Figure 4: Structure of the *Modern* database included in the Gremlin Console package.

Listing 2: Gremlin Database Example

01 graph = TinkerFactory.createModern()
02 ==>tinkergraph[vertices:6 edges:6]
03
04 gremlin> g = graph.traversal()
05 ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
06
07 gremlin> g.V().label().groupCount()
08 ==>[software:2,person:4]
09
10 # Names of all persons:
11 gremlin> g.V().hasLabel('person').values('name')
12 ==>marko
13 ==>vadas
14 ==>josh
15 ==>peter
16
17 # Name of person with ID 1
18 gremlin> g.V(1).values('name')
19 ==>marko
20
21 # The name and age of the persons who know the person with ID 1:
22 gremlin> g.V(1).outE('knows').inV().valueMap('name', 'age')
23 ==>[name:[vadas],age:[27]]
24 ==>[name:[josh],age:[32]]
25
26 # Which of these people who know the person with ID 1 is less than 30?
27 gremlin> g.V(1).out('knows').has('age', lt(30)).values('name')
28 ==>vadas
29
30 # Display matches step-by-step
31 gremlin> g.V().outE('created').inV().next()
32 ==>v[3]
33
34 gremlin> g.V().outE('created').inV().next(2)
35 ==>v[3]
36 ==>v[5]

The database contains information about a small group of software developers and their products. Some of the developers know each other and have contributed to the same projects. Additionally, Gremlin needs a traversal source, which describes how it should move through the graph. The easiest way to do this is to use the standard path (Listing 2, lines 4 and 5).

To output all the nodes, you use g.V(), and to list all edges, you use g.E(). The queries are generally organized in steps separated by dots. As previously mentioned, g is the traversal source, and V() is an iterator that runs step by step through all nodes. Further steps can now output properties of the respective node (.values()) or check it for certain conditions (.has()).

Additionally, filter results can be aggregated. For example, what happens when the user asks the database how many nodes of different types it has (Listing 2, lines 7 and 8)? Further examples of simple queries are demonstrated in lines 10-28. In principle, each query walks along the tree, which can be observed several ways. You can display hits, step by step (e.g., the first node a created edge points to; lines 31 and 32), the next two (lines 34-36), and so on.

With .toList(), you can return all matches in one fell swoop, and .toSet() returns all matches, but without duplicates. Also, you can retrieve the path that was traversed to get to a search hit. The call in Listing 3 (first query) displays the software projects in which developer Josh was involved.

Listing 3: Simple Queries

gremlin> g.V().hasLabel('person').has('name','josh').outE('created').inV().values('name')
==>ripple
==>lop
gremlin> g.V().hasLabel('person').has('name','josh').outE('created').inV().values('name').path()
==>[v[4],e[10][4-created->5],v[5],ripple]
==>[v[4],e[11][4-created->3],v[3],lop]
gremlin> g.V().outE('created').has('weight', 1.0).inV().values('name')
==>ripple
gremlin> g.V().outE('created').has('weight', 1.0).outV().values('name')
==>josh

The path determination (second query) first finds node 4 with label person and name property josh; from there, it then traces outgoing edges 10 and 11, which are both tagged with the created property (i.e., they must refer to software). By doing so, it finds nodes 3 and 5, which stand for lop and ripple.

In the case of a software product that someone has developed alone, the weighting of the created edge needs to be set to a value of 1.0, as shown in the third query. The final query finds the programmer who created a project on their own by looking for created edges with a weighting of 1.

These queries have all been simple thus far. Traversals also can follow branches in the graph (branching traversals), be processed recursively (recursive traversals), follow paths (path traversals), or refer to subgraphs (subgraph traversals). The step-by-step structure, however, always remains the same, and the number of commands required for the steps does not become unmanageable. In this article, though, I'll stick with simple queries.

If you want to explore the Gremlin language on your own, the Gremlin Console comes with other sample datasets, some of them far more extensive, like the eponymous database of Grateful Dead songs that includes information about the vocalists and composers and the order in which they appeared live and on vinyl. This database has many hundreds of nodes and thousands of edges and is tricky to visualize clearly.

(No) Puzzles

Take the classic river-crossing riddle: A farmer arrives at a river with a wolf, a goat, and a cabbage. He can only carry one animal or the cabbage in his boat at a given time. Furthermore, he cannot leave the wolf and the goat, or the goat and the cabbage, alone on a bank, because the former would eat the latter. How does he get all his belongings across the river unharmed? Spoiler alert! If you want to work it out yourself, read no further.

First, the farmer takes the goat to the other side and goes back empty. Then he picks up the wolf, but takes the goat back with him. Next, he can take the cabbage across and leave it with the wolf, and finally he can take the goat across the river.

This procedure can be represented as a graph with nodes as the respective end states of a crossing (Figure 5). The letters FGCW, for example, stand for farmer, goat, cabbage, and wolf. The comma symbolizes the river. Unauthorized states are colored red. Each step has an action the farmer could take to undo the transport that has just taken place, but this action would not make any progress and is not shown here for the sake of clarity.

Figure 5: The river-crossing puzzle as a sequence diagram. The boxes show the states at the end of a crossing.

Before saving the graph, you can simplify it further by omitting the reverse actions and the transitions to the illegal states. Whatever remains can be stored directly in the database without the need to design a schema or define tables first, unlike relational models. You could now enter all nodes and all edges interactively one after the other. However, it is easier and less prone to error to create the database content in a file (Listing 4) in XML format (GraphML) and then load this file into the Gremlin console with the commands in Listing 5.

Listing 4: River XML Database

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.1/graphml.xsd">
  <key id="labelV" for="node" attr.name="labelV" attr.type="string" />
  <key id="name" for="node" attr.name="name" attr.type="string" />
  <key id="labelE" for="edge" attr.name="labelE" attr.type="string" />
  <graph id="G" edgedefault="directed">
    <node id="1">
      <data key="labelV">state</data>
      <data key="name">FGCW,</data>
    </node>
    <node id="2">
      <data key="labelV">state</data>
      <data key="name">CW,FG</data>
    </node>
    <node id="3">
      <data key="labelV">state</data>
      <data key="name">FCW,G</data>
    </node>
    <node id="4">
      <data key="labelV">state</data>
      <data key="name">C,GFW</data>
    </node>
    <node id="5">
      <data key="labelV">state</data>
      <data key="name">FGC,W</data>
    </node>
    <node id="6">
      <data key="labelV">state</data>
      <data key="name">G,FWC</data>
    </node>
    <node id="7">
      <data key="labelV">state</data>
      <data key="name">FG,WC</data>
    </node>
    <node id="8">
      <data key="labelV">state</data>
      <data key="name">,FGCW</data>
     </node>
    <node id="9">
      <data key="labelV">state</data>
      <data key="name">W,GFC</data>
    </node>
    <node id="10">
      <data key="labelV">state</data>
      <data key="name">WFG,C</data>
    </node>
    <edge id="11" source="1" target="2">
      <data key="labelE">action</data>
      <data key="name">with_goat</data>
    </edge>
    <edge id="12" source="2" target="3">
      <data key="labelE">action</data>
      <data key="name">empty_return</data>
    </edge>
    <edge id="13" source="3" target="4">
      <data key="labelE">action</data>
      <data key="name">with_wolf</data>
    </edge>
     <edge id="14" source="4" target="5">
       <data key="labelE">action</data>
      <data key="name">with_goat</data>
    </edge>
    <edge id="15" source="5" target="6">
      <data key="labelE">action</data>
      <data key="name">with_cabbage</data>
    </edge>
    <edge id="16" source="6" target="7">
      <data key="labelE">action</data>
      <data key="name">empty_return</data>
    </edge>
    <edge id="17" source="7" target="8">
      <data key="labelE">action</data>
      <data key="name">with_goat</data>
    </edge>
    <edge id="18" source="3" target="9">
      <data key="labelE">action</data>
      <data key="name">with_cabbage</data>
    </edge>
    <edge id="19" source="9" target="10">
      <data key="labelE">action</data>
      <data key="name">with_goat</data>
    </edge>
    <edge id="20" source="10" target="6">
      <data key="labelE">action</data>
      <data key="name">with_wolf</data>
    </edge>
  </graph>
</graphml>

Listing 5: Loading the Database

gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@1bbae752
gremlin> graph = TinkerGraph.open(conf)
==>tinkergraph[vertices:0 edges:0]
gremlin> g=graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.io("/home/jcb/river.xml").with(IO.reader,IO.graphml).read().iterate()

Now you can start a query at the node that contains the start position (FGCW,). From then on, you need to trace all outgoing edges and incoming nodes until the query finds the node that represents the final state after the river crossing (,FGCW). At this point, you want the database to output the path taken (.path(); the example has two equivalent paths). Listing 6 discovers the solution to the riddle. The output from the database lists the ID of the node (e.g., [1]), followed by the ID of the outgoing edge (e.g., [e11]) and a description of the edge with start and destination node and transported goods (e.g., 1-with_goat->2). After reaching the node, it continues in the same way until the query has arrived at the destination node.

Listing 6: Solving the Puzzle

gremlin> g.V().has('name','FGCW,').repeat(outE().inV()).until(has('name',',FGCW')).path()
==>[v[1],e[11][1-with_goat->2],v[2],e[12][2-empty_return->3],v[3],e[18][3-with_cabbage->9],v[9],e[19][9-with_goat->10],v[10],e[20][10-with_wolf->6],v[6],e[16][6-empty_return->7],v[7],e[17][7-with_goat->8],v[8]]
==>[v[1],e[11][1-with_goat->2],v[2],e[12][2-empty_return->3],v[3],e[13][3-with_wolf->4],v[4],e[14][4-with_goat->5],v[5],e[15][5-with_cabbage->6],v[6],e[16][6-empty_return->7],v[7],e[17][7-with_goat->8],v[8]]

The result is obvious in this case and could be read off from the graph at first glance, but if you were to store something more complex like the public transport network in Munich as a graph and search for a connection between two points (which might require changing trains one or more times), the task would no longer be trivial – especially if the edges are weighted by distance, travel time, or travel costs, and you are looking for the shortest, fastest, or cheapest route. The method also works even if the graph is too large to be visualized and surveyed.

Hands On

Beyond such finger exercises: For which applications are graph databases best suited in practical use? Are there any useful examples? In this article, I show you some impressive potential applications with genuine case studies.

Say Ms. X is an employee of company Y and lives in Z. She has an unsuspicious current account at a local savings bank. Mr. A lives at exactly the same address and is employed by B as a manager. B has subsidiary C, and it maintains an offshore account in the Cayman Islands. Ms. X's connection to this account is only indirect, but graph databases are particularly useful for tracing such connections around several corners. Indeed, the reporters of the ICIJ used this tool for months of analysis of the 2.6TB of leaked tax data from the Panama Papers – as mentioned at the beginning of the article. The well-known Neo4j open source graph database makes the data available to everyone free of charge for testing in its sandbox (Figure 6), accessible online from a browser [10].

Figure 6: The result of a search in the Panama Papers database.

Knowledge represented in the form of an ontology (i.e., a formally ordered hierarchy of terms) results in a tree-like graph. A semantic network, which also represents terms and their mutual relationships, is also a graph. Within the individual scientific disciplines, ancestry or kinship relationships in biology or history are a good example. The research project Nomen et Gens (NeG) [11], for example, stores in a graph database more than 21,500 individual records of names and persons in continental Europe between AD400 and 800, together with their relationships.

Another example from science is the cross-location, cross-discipline, and cross-species networking of research data on diabetes at the German Center for Diabetes Research (DZD) in the project Graphs to Fight Diabetes [12]. The database links many heterogeneous data items, including measured values, data from basic research, clinical studies or animal experiments, microscope images, and information on metabolic pathways, genes, or proteins, but also side effects and finally publication references. This combination results in a kind of meta-database that the researchers hope will help them identify new subtypes of the disease and find new, individualized steps for prevention or therapy.

The database currently comprises about 2.1 billion nodes and 3.8 billion edges and weighs in at about 480GB. It is currently used by around 450 scientists at the DZD but is due to be expanded to other centers of health research this year.

What is helpful here is that the relationships between the contents can be visualized (Figure 7). In this way, queries can also be put together by pointing and clicking, without necessarily having to learn one of the supported query languages. In the simplest case, you just need to select two nodes and choose Find Shortest Path from the context menu by right-clicking. Of course, you can also program queries in Cypher or a script language like Python or R.

Figure 7: Authors, institutions, publications, and how they are connected, as shown by this small excerpt from the graph database of the DZD. © Deutsches Zentrum f¸r Diabetesforschung

Yet another example of the efficacious use of a graph database is the Smart Infrastructure project at Siemens (see the "Intelligent Twins" box).

Intelligent Twins

By Markus Winterholer

At first glance, it seems strange: How can a house be intelligent? What is meant, however, is not so much its walls as the built-in air conditioning, heating, or lighting, which are linked by a network of diverse devices, sensors, service protocols, or contracts, are constantly monitored and analyzed and must be managed in a cost- and resource-saving manner.

In light of occupancy statistics, would it be advantageous to make two small rooms out of a large meeting room? In which rooms of more than 20 square meters in buildings without air conditioning was the average temperature in July 2019 greater than 30C (86F)? We're investigating such questions at Siemens under the heading of Smart Infrastructure. Our goal is to turn more than half a million commercially used buildings into smart buildings with the help of cloud-based digital services and products.

A digital twin serves as a collection point for all plans, contracts, models, service reports, and sensor data. This data is also available for simulations or is used for remote maintenance or energy optimization or flow into apps for room booking and navigation in the building. Combining this flood of data in a single model requires a highly flexible data structure.

Shortly after the start of the project, we realized at Siemens that a conventional relational database would not be fit for the purpose because of the variety of relationships between the data. Various NoSQL databases also reached their limits, because they structure the data poorly. The solution finally came with the introduction of a graph database, which made it possible to reduce the time-to-data by 30 to 50 percent and, in turn, allowed the core modeling team to refocus on refining the model rather than brooding over how to formulate extremely complex queries.

In our case, we chose Amazon Neptune because Amazon Web Services (AWS) gives us access to the widest range of services and lets us scale on demand. Additionally, with AWS, we can be confident that other services that we need will be continually developed. We have also reduced our operating costs.

What lessons have we learned? First, not all data is equally suitable for a graph database. It may make sense to leave time series data in a special database, for example. Second, the graph database alone does not solve the problem if the data model and queries are not adapted. These issues and formulating the queries require skilled personnel. Finally, the highest benefit can often only be achieved by suitable visualization of the query results.

MISSING IMAGE

Markus Winterholer, Global Head of Digital Buildings Go-To-Market Strategy, Siemens

Conclusions

Where the task in hand is primarily a matter of processing statements about relations between people, things, events, points in time, or places, graph databases are superior to their relational siblings for several reasons. Queries for relationship chains in particular, such as my friends' friends, need recursive joins in the relational world, which are very complex and slow and become all the more evident the deeper the concatenation goes. Graph databases are orders of magnitude more powerful for such problems. NoSQL databases like key-value stores or document-oriented databases have similar disadvantages. These databases often shift the data processing onus to the application when relationships are queried.

Another advantage of graph over relational databases is that they can be extended by new relations at any time and do not require a pre-defined schema. This circumstance also makes them more agile because they adapt better to a step-by-step expansion of the program logic. Furthermore, relational schemas, such as those that can be designed in an entity relationship model, only support simple and unidirectional relationships, whereas in a graph database, they have specific properties and can be bidirectional.

The opportunities to capitalize on these advantages are constantly growing as the world becomes increasingly interconnected. The value of data lies not only in sheer volume, but in the connections, and graph databases are the means of choice for representing, storing, and querying these connections .