Feature Neo4j Fraud Detector

Graph database Neo4j discovers fake reviews on Amazon

Digital Detective1/2 ad

A Neo4j graph database example shows how to uncover fraudulent reviews on Amazon. By Mike Schilli

Graph databases do not use the relational tables and join commands of traditional relational databases. Instead, they look for relations between nodes and support queries that would be slow or even impossible to process in their relational counterparts. In this article, I take advantage of the graph database structure to create an algorithm that detects fake product reviews on Amazon with a Neo4j instance in a Docker container.

Fraudulent Reviews

On closer inspection of a product on Amazon that has consistently earned five-star ratings, it often turns out that many of the reviewers are professional lackeys. The text obviously betrays that the author did not even use the product (Great product, fast delivery!). If you then search for further reviews from the same customer, you will often find other five-star reviews that look very similar. The problem is so evident on Amazon that customers rub their eyes in amazement wondering why the online giant doesn't intervene.

Graph databases can help identify such shenanigans. Several criteria can help detect patterns in the typical behavior of fraudsters and expose them. Does a single customer write hundreds of five-star ratings? Suspicious. Does a product have many of these boilerplate reviews? There could be something wrong with that. Do the members of a gang of fraudsters all review the same products?

If the alarm bells go off for only one of these criteria, you might not necessarily suspect misuse, but two or more increases the likelihood of fraud. Further investigation would then be worthwhile to see whether the intent is to rip off customers.

Detection Algorithm

The last of the previously mentioned criteria seems interesting from a programming point of view. How does an algorithm find groups of users who all rate the same products without having any clues as to which users they are?

Listing 1 shows a fictitious YAML list of products with the names of evaluators. A similar list could be obtained with real data from the Amazon website with the official API or a scraper.

Listing 1: reviews.yaml

reviews:
  product1:
    - reviewer1
    - reviewer2
    - reviewer3
    - reviewer7
  product2:
    - reviewer1
    - reviewer2
    - reviewer4
    - reviewer8
  product3:
    - reviewer3
  product4:
    - reviewer4
    - reviewer7
  product5:
    - reviewer5
    - reviewer8
  product6:
    - reviewer6

The human eye immediately recognizes that a dubious duo consisting of reviewer1 and reviewer2 obviously reviewed the products product1 and product2 together. If the data were only available in a relational data model, it would be very time consuming to discover this connection in a very large database in something less than an infinite amount of time.

With graph databases that simply traverse along the relations between nodes instead of juggling relational tables and computationally expensive join commands, it is relatively easy to program smart algorithms. I discovered graph databases six years ago and featured them in an

article [1]; however, the development of the genre has not stood still, which calls for a new look.

Prettified

The Go program presented in this issue converts the YAML list from Listing 1 into a graph that shows which products were evaluated by which persons.

To do this, it sends commands to a locally installed Neo4j database, which, when the program has run, displays the graph shown in Figure 1 with the relations between products and reviewers. The screenshot is taken from the window of a web browser, which uses http://localhost:7474 to point to a Neo4j installation that conveniently provides not only the server in a container, but also a web interface for graphically enhancing the data.

Figure 1: The Neo4j relation graph is accessible in the browser at http://localhost:7474.

Tracking Down Suspects

Once the data has been bundled onto the Neo4j server, users can type interactive commands in the Cypher shell to make queries and start analyses. Figure 2 shows a call to the Similarity algorithm [2] from a Neo4j plugin of scientific tools.

Figure 2: The Similarity algorithm has tagged reviewers 1 and 2 as suspicious.

The algorithm finds nodes in the graph that are connected by their relations to as many common neighbors as possible and then evaluates these as similar. It calculates the numerical degree of similarity from the Jaccard index [3] of the candidates.

Figure 2 shows the result: Obviously the algorithm has determined that reviewers 1 and 2 have jointly evaluated products 1 and 2 and therefore assigns a numerical similarity value of 1.0 to the two rascals. Of course, this is not yet hard evidence of unfair practices, but the result at least shows where you could drill down further to reveal more evidence in a suspicious case.

What is interesting in the result is that other reviewers also evaluated several products, but not the same products in partnership, and were therefore given a lower similarity value. For example, reviewer 8 rated products 2 and 5, and reviewer 4 rated products 2 and 4, both receiving only 0.5 on the similarity scale because their behavior was less suspicious.

In the Thick of It

The best way to install a Neo4j instance on your home computer is to use a Docker container, which the command docker run retrieves from the network to launch a Neo4j server (Figure 3). Then, you can jump into the container by typing docker exec and open the interactive Neo4j Cypher shell to send commands to the server.

Figure 3: Docker commands retrieve Neo4j from the network, launch the server in a container, and open the interactive Cypher shell.

To allow browsers and API scripts to access the containerized Neo4j server from outside, the call in Figure 3 exports ports 7474 and 7687 from the container to the host machine, where the user can then access the Neo4j web server in a browser over http://localhost:7474.

After feeding the data into Neo4j, the browser view in Figure 1 pointing to http://localhost:7474 shows the advanced relationship model. On port 7687, the server in the container listens for commands from the Bolt terminal API officially used by Neo4j; scripts can use this port to query the database and feed in new data.

The call to Docker connects the data/, logs/, import/, and plugins/ directories on the host to the container, which allows the host and the container to exchange database files and logs; the user can load new plugins off the network in plugins/ and upload them to the container.

Automatic Feed

Once the server is running in the container, the Go program can form a series of Neo4j commands from the YAML list of review data to feed the relationships into the database. To do this, first create nodes of the Reviewer and Product types and then insert a relation reviewed between the two (Listing 2); you could also enter these commands manually in the Cypher shell.

Listing 2: neo4j-commands.txt

01 MERGE (product1:Product {name:'product1'})
02 MERGE (reviewer1:Reviewer {name:'reviewer1'})
03 MERGE (reviewer1)-[:Reviewed {name: 'reviewed'}]-(product1)
04 MERGE (reviewer2:Reviewer {name:'reviewer2'})
05 MERGE (reviewer2)-[:Reviewed {name: 'reviewed'}]-(product1)
06 MERGE (reviewer3:Reviewer {name:'reviewer3'})
07 MERGE (reviewer3)-[:Reviewed {name: 'reviewed'}]-(product1)
08 MERGE (reviewer7:Reviewer {name:'reviewer7'})
09 MERGE (reviewer7)-[:Reviewed {name: 'reviewed'}]-(product1)
10 [...]

The MERGE command creates a new entry, either a node or a relation, which could just as easily be done with a CREATE command; however, MERGE will not run wild if the entry already exists. Line 1 creates a new node of type Product, assigns it the name attribute product1, and stores a reference to it in the product1 variable. The same happens with a Reviewer node in line 2; line 3 then links the previously defined reviewer1 and product1 variables with a relation of type Reviewed, which sets the name attribute to reviewed.

Entering all the data manually would quickly get on a user's nerves, which is why the Go program in Listing 3 automates the task of generating a series of Neo4j commands from the YAML list and sends them over port 7474 to the Neo4j server running in the container.

Listing 3: rimport.go

01 package main
02
03 import (
04   "database/sql"
05   "fmt"
06   _ "gopkg.in/cq.v1"
07   "gopkg.in/yaml.v2"
08   "io/ioutil"
09   "log"
10 )
11
12 type Config struct {
13   Reviews map[string][]string
14 }
15
16 func main() {
17   yamlFile := "reviews.yaml"
18   data, err := ioutil.ReadFile(yamlFile)
19   if err != nil {
20     log.Fatal(err)
21   }
22
23   var config Config
24   err = yaml.Unmarshal(data, &config)
25   if err != nil {
26     log.Fatal(err)
27   }
28
29   created := map[string]bool{}
30   cmd := ""
31     // nuke all content
32   toNeo4j(`MATCH (n) OPTIONAL MATCH
33            (n)-[r]-() DELETE n,r;`)
34
35   for prod, reviewers :=
36       range config.Reviews {
37     for _, rev := range reviewers {
38       if _, ok := created[prod]; !ok {
39         cmd += fmt.Sprintf(
40         "MERGE (%s:Product {name:'%s'})\n",
41           prod, prod)
42         created[prod] = true
43       }
44       if _, ok := created[rev]; !ok {
45         cmd += fmt.Sprintf(
46         "MERGE (%s:Reviewer {name:'%s'})\n",
47           rev, rev)
48         created[rev] = true
49       }
50       cmd += fmt.Sprintf(
51         "MERGE (%s)-[:Reviewed " +
52         "{name: 'reviewed'}]-(%s)\n",
53          rev, prod)
54     }
55   }
56   cmd += ";"
57   toNeo4j(cmd)
58 }
59
60 func toNeo4j(cmd string) {
61   db, err := sql.Open("neo4j-cypher",
62     "http://neo4j:test@localhost:7474")
63   if err != nil {
64     log.Fatal(err)
65   }
66   defer db.Close()
67
68   _, err = db.Exec(cmd)
69
70   if err != nil {
71     log.Fatal(err)
72   }
73 }

YAML in the Go Universe

Listing 3 uses Unmarshal() (line 24) from the official YAML module in the Go universe to transpose the YAML data as a byte array after reading with io/ioutil (line 8) into a Go data structure.

The strict Go typing plays along with the fairly casual YAML here in a fairly offhand way by defining a string-indexed hash table with entries consisting of arrays of strings. The Config type structure starting in line 12 defines the hash map with the nested string arrays in the Reviews entry. Capitalization is important here so that the YAML module can access it.

Starting in line 35, two for loops iterate over all products in the hash map and then over the array of reviewers for each entry. Before line 50 compiles the command for adding the relation, the if conditions in lines 38 and 44 check whether the two endpoints of the relation already exist as nodes in the database.

If the created map variable indicates that a node is still missing, the code adds to the cmd string a command that creates the node with a MERGE instruction. It terminates all commands with line breaks. In this case, it is important not to send semicolon-separated Neo4j commands, because it will cause problems if some of them define variables (e.g., reviewer1) that are reused later (when the relation is created). A semicolon terminates a command (line 56), and Neo4j then forgets all variables defined previously.

Contacting the Server

The toNeo4j() function contacts the browser port of the server in the container starting in line 60 assembles. It transmits the command string cmd, which it has assembled from the map data, and preceeds the instructions with a command to first delete all previously existing data.

The open source package used here, cq from GitHub, is a bit outdated. Although it does not use the API module's Bolt connection supported by Neo4j on port 7687, it works fine. It's also easier to install than the default, which forces you to download some obscure Bolt binaries.

In typical SQL style, line 61 contacts the server in the Docker container. Line 68 uses Exec() to send the command present in cmd over the port, which the server acknowledges with an error message if something went wrong.

With the command sequence

$ go mod init rimport
$ go build

Go fetches the libraries needed to create the binary from GitHub and creates an executable program named rimport. When called, the executable first reads the reviews.yaml file from disk and then pumps the necessary commands into the container port to the Neo4j server. The user can then send queries to the data model for fraud detection, as shown in Figure 2.

Installation Troubles

The current Docker image neo4j:latest drags in the latest Neo4j version 4.0.3, which does not yet support any graph algorithms. To install it, you have to download a .jar file from the Neo4j site [4] and dump it into the ~/neo4j/plugins/ directory. There, the Docker container will grab it when the Neo4j server is started, because the docker run command in Figure 3 imports the plugin directory with the -v option.

Hold on, not so fast: The Graph algorithm's plugin is only available as version 3.5.9. If you think you can simply use it with a Neo4j database of version 4.0.3, think again. Right after restart, you'll see the container quickly giving up the ghost with a long, but completely meaningless, stack trace. If you install neo4j:3.5.9 instead of neo4j:latest, you will have more luck. The server starts up properly, and the database query for algorithms in the algo.* namespace reveals a long list (Figure 4).

Figure 4: After installing the graph plugins, Neo4j shows the retroactively loaded algorithms.

Unfortunately, you will encounter more obstacles. When you try to use one of the algorithms, an error message on the screen explains that this is not possible in a "sandbox" for safety reasons. Instead, you will need to exempt the imported algorithms from the routinely imposed restrictions. To do this, the environment variable NEO4J_dbms_security_procedures_unrestricted is set to a regular expression to specify that everything below the namespace algo enjoys free rein.

The Docker command in Figure 3 already defines the variable correctly. It also sets the NEO4J_AUTH variable to neo4j/test, which tells the server to omit the otherwise mandatory password reset. Let the fun begin!