Feature Graph Database Advantages

Advantages of data analysis with graph databases

Relationship Status

Analyze scattered but related data in real time with a graph database. By Zeljko Dodlek

The volume and variety of collected data is constantly growing, which prompts enterprises, for certain use cases, to turn to a new generation of database technology. The structure and query language of graph databases allow the correlation and recognition of current data in real time. Furthermore, graph databases offer massive speed advantages when it comes to evaluating special datasets.

Limits of Relational Databases

Most enterprises, through websites and other methods, build huge databases. Analyzing these databases can reveal important information about customers, helping to predict user behavior, determine optimal pricing, and overcome operational challenges. For some time now, the hurdles for the use of graphs have been falling as a variety of functions from modern query languages have become available. On top of this, most cloud vendors now offer graph technology as a service in addition to the proven relational database management system (RDBMS) options.

RDBMSs were created in the 1970s for storing, processing, and retrieving structured data, such as records of financial transactions that are stored in a tabular format, and they are still widely used in organizations. However, data volume is now a genuine problem, and developers are increasingly looking for other approaches. NoSQL databases and key/value stores can solve part of this challenge, but they do not provide the analytical capabilities that organizations need to turn data into actionable insights.

Additionally, other fields of application centering on Big Data (e.g., fraud detection, supply chain management, risk analysis, or recommendation generation) also cause headaches because individual data structures have to be created in relation to each other. The graph database has been around for several decades as a concept, but recent innovations in storage and processing performance and the evolution toward Turing completeness [1] have meant that the graph is often the best (and sometimes only) approach to addressing these challenges.

An RDBMS in itself is helpful and good, but if you want to implement graph-style solutions (e.g., in SQL), you will encounter difficulties. SQL databases lack native support for edges, which means that additional work is required to create one-to-many and many-to-many connections. As a result, analyzing relationships across more than three or four connections becomes computationally expensive. The integration of new or changed data classes means creating new table structures and is therefore a cost factor that should not be underestimated.

Moreover, creating data connections, correcting errors, and maintaining queries for data connections can often be extremely difficult. With a graph query language, a developer can write complex queries that are easier to build and debug because the relationships modeled by the data faithfully reflect real-world situations. Some of these languages are specifically designed for analysis, so it is not necessary to leave the query language repeatedly for compute tasks.

Although NoSQL databases are ideal for storing and retrieving unstructured data, they suffer from the same query language limitations as SQL. Because of their structure, relationship analyses of more than two or three hops also require multiple table scans, which means that the processing time increases rapidly the more complex the queries become.

Advantages of Graph Databases

Graph databases can do many things that RDBMSs cannot do, or have difficulty doing. Conceptually, graphs model data more naturally than do RDBMSs. In a graph database, objects can be set in relation to each other and linked accordingly, rather than being forced into a standardized table format.

Many of these challenges in modeling complex information networks are solved by graph databases that use database schemata and a graph query language. The basic structure of a graph database is node-edge-node, which makes it possible in practice to represent a relationship such as "object A is connected to object C by edge B." With a well-designed graph solution, developers can add properties to these three elements, creating an environment with context information for all data.

In this structure, the hop (i.e., the transition from one node to the next) is the basic unit for calculations. From this hop, a relationship is calculated that is then returned as a value. Acquiring and processing such hop values are the basic components of a graph query, and given a graph query language with Turing completeness, the calculations required for complex analyses can be grouped with the hop values.

Because of the structure of a graph database, the index search for join operators does not encounter the performance bottlenecks typical of SQL because the connection information is specified directly on input. Therefore, no further calculations need to be performed for a graph query against the data.

This property is only found in native graph databases and is referred to as index-free adjacency, which allows a traversing rate of several million nodes per second and is why the response times are several orders of magnitude faster than for linked queries in relational databases. A good example is computing the shortest path in a route calculation. This function is also increasingly being used in machine learning and artificial intelligence (AI) scenarios.

Native graph databases with massively parallel processing capabilities that enable rapid data compression and decompression can deliver results in seconds that would take several hours to compute with traditional database technology.

Graph databases support a number of algorithm classes that are simply not feasible in an RDBMS:

Path algorithms, which find the shortest path between nodes and evaluate paths (e.g., shortest path, cycle detection, and minimum spanning tree).
Centrality algorithms, which rank nodes according to the degree of their connection or the central position of a node by edge weighting (page rank and proximity centrality).
Community algorithms, which can determine how a group is clustered or divided (connected components, label propagation, triangular counting, and Louvain modularity).
Similarity algorithms, which determine how similar a node is to its neighbors (cosine similarity, Jaccard similarity).
Classification algorithms, which predict the classification of a given node according to previously classified nodes (k-nearest neighbor, cosine similarity).

Such algorithms can be used to optimize existing applications and develop entirely new solutions for companies that analyze huge pools of data.

Graph databases are characterized like neural networks, and thus machine learning, by the interconnection of properties, which is why graphs are a valuable tool for AI algorithms.

Applications in Logistics

Graph databases allow organizations to analyze data in a format that better reflects the relationships between the objects underlying the data, allowing developers to use meaningful approaches that address issues such as page ranking, social media links, customer analysis, fraud detection, real-time product recommendations, and risk assessment.

For example, many supply chain management solutions facilitate work in specific areas, such as storage and transportation, but approaches that cover all aspects are rare. The datasets required for supply chain management are inherently large, stored in isolation, and distributed across different systems on both the material and production side.

Graphs allow developers to do justice to the essence of supply chain management. Their holistic data model is based on the actual relationships between the individual elements of a supply chain, such as "plant A buys component B from supplier C" or "component B is delivered by carrier D to plant A." If thousands of such relationships are linked together, a typical use case for a graph is created.

Most companies that use graph solutions for supply chain management use them in combination with their existing enterprise resource planning systems to gain real-time insights into planning, pricing, and resource allocation.

Use in AI

Graphs have always been closely related to the field of artificial intelligence, especially machine learning. In the field of pattern recognition, graph algorithms are used (Figure 1), for example, for unsupervised machine learning, neural networks, and deep learning processes. Typical use cases include fraud detection, personalized recommendations, identification of target groups and influential users, and identifying weaknesses and bottlenecks in operations and in the supply chain.

Figure 1: Machine learning algorithms use comprehensive datasets to perform differentiated classification, forecasting, and processing tasks.

Machine learning is a challenging field, and graph-based models are no exception. With every hop (i.e., every level of networked data), the size of the data pool to be searched increases exponentially, and from a computational point of view, this approach is simply too expensive for other database architectures. For data connections in relational databases to be evaluated meaningfully, uneconomic table joins are necessary. Additionally, classic NoSQL databases can recreate graphs at the application level only, are too cost-intensive, and do not provide support for complex queries.

Machine learning algorithms in graphs are used, among other things, to detect spam calls on mobile phones by analyzing the behavior of the source phone in relation to the target phone. However, simply determining whether the source phone is known to the target phone and how many times calls from the source phone have been rejected by other callers can only determine whether the call is spam with a rough probability.

Mobile phone provider China Mobile has developed a more sophisticated procedure for this: With the help of a graph database, the behavior of the 900 million mobile phones registered with China Mobile can be checked, which means that about 2 billion calls per week can be analyzed. An algorithm that analyzes 118 different characteristics for each investigated phone, combined with a native graph database featuring massively parallel processing capacities, enables calls to be examined in real time and allows the called person the option of rejecting an incoming call while the phone is still ringing.

Conclusions

Companies face the growing challenge of having to process vast amounts of information. This problem is not only about storing and retrieving data. Analysts also need to be able to perform in-depth analysis to obtain meaningful results. Graph databases are already helping in many areas, including supply chain management, customer analysis, and social media management. Together with complex algorithms based on graph query languages, this type of data storage enables the development of new applications. In much the same way that relational databases revolutionized data collection in the 1970s and 1980s, graphs will make it possible to break new ground in data processing over the next few decades.