Revealing malware relationships with GraphDB: Part 1

In this post, we will learn how using a Graph Database like Neo4j can help visualize malware relationships and extend these relationships to identify patterns between samples. Before we dig into Neo4j, let’s start with some fundamental graph terminologies:

Nodes represent entities such as a human, car, laptop or phone.

Properties are attributes nodes can contain. A steering wheel or tires would be a property of the “car” node.

Labels are a way to group together nodes of a similar type. For example, a label of “FastFood” may include nodes such as “Taco Bell, McDonald’s, and Chipotle”.

Edges (or vertices) represent the relationship connection between two nodes. Relationships can also have their own properties.

Getting started with Neo4j


Neo4j is a Graph Database commonly known for its pure simplicity and easy to use interface. I find the structure of a graph database quite fascinating, on top of learning how to normalize malware analysis data for each sample into a schema that works for a graph database. To get started, we first need to get a Neo4j instance running. The quickest way to do this is docker. Once you have docker installed (, you can quickly pull down a Neo4j docker image using the following command:

docker pull neo4j

Once you have the image downloaded to your system, you can start the container by running the command below:

docker run \

— publish=7474:7474 — publish=7687:7687 \

— volume=$HOME/neo4j/data:/data \


If all goes well, you should see some standard output in your console, including the line:

INFO Remote interface available at http://localhost:7474/

If you navigate to this url in your browser, you should be prompted to login to the Neo4j docker container using the default credentials “neo4j/neo4j”. After logging in and changing your password, you can now begin exploring the interface. If you’re new to Neo4j, I would recommend digging into the “Learning about Neo4j” section, so you can get a handle on the syntax for searching and updating node or edges in the database.

Define the schema

In order to load the data into Neo4j, we need to build a schema that defines our nodes, edges and their properties. Like most databases, defining a standardized schema is very important before inserting data. Let’s start by taking a look at what a simple File node looks like below:

MERGE (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }) RETURN n

You can see from the statement above, we have a label of File, which has the properties of md5, sha1, sha256 and size. A Label is a way to group nodes of a similar type together. Notice we don’t have the name or path properties inside the File node. This is because any file can be renamed and moved to a different location but the hashes and size will remain the same. Because of this, we would create another node for each filename and path, as many other malware samples may reuse the same name or file path, thus creating a relationship between two different pieces of malware. However, a relationship based solely on a filename or path on disk is not the strongest relationship unless it’s a very unique name or path. I’ve outlined the other nodes, labels and their properties below in Neo4j’s Cypher syntax.

MERGE (n:FileType { type : {file_type} }) RETURN n

MERGE (n:Compiled { timestamp : {timestamp} }) RETURN n

MERGE (n:Library { name : {library} }) RETURN n

MERGE (n:Function { name : {function} }) RETURN n

MERGE (n:Detection { name : {detection} }) RETURN n

In addition to creating the various nodes, edges and properties, we also want to define the relationships these nodes can have with each other and the direction of those relationships. Let’s check out an example on creating a relationship between the File and FileType nodes using the Cypher query below:

MATCH (n:File { md5 : {md5}, sha1 : {sha1}, sha256 : {sha256}, size : {size} }), (f:FileType { type : {file_type} }) MERGE (n)-[:HAS_FILETYPE]->(f)

In this query, we first have to match on the File node (assigned to the variable n) , then we have to match on FileType node (assigned to f). Once the matches are collected, we establish a relationship HAS_FILETYPE between these two nodes. In Neo4j, relationships can only have a single direction and cannot have a relationships that go in both directions. You also cannot have a relationship that points to another relationship. To counter this, we can use Intermediary Nodes to help link nodes together in more complex relationships (intermediary nodes will be covered in a future post). To better show how this will look, let’s view the final schema below in Neo4j:

Testing a sample set

For this post, i’m using WannaCry ransomware samples to better understand the relationships between these binaries at a static analysis level (not running the malware). To get started, we need to get the malware metadata into Neo4j. To extract the malware’s static attributes, I used PEFRAME ( against each sample and saved all the JSON outputs to a single directory. We can then use a little bit of Python to load the JSON data for each sample and create cypher queries to quickly create our nodes, edges and relationships. Let’s take a look at a single node and all its relationships below:

We can see from the image above that our single WannaCry specimen imports six libraries, may use at least 83 functions from these six libraries and has four detections. We also see green nodes, which outline the executable’s FileType, which in this example is “PE32 executable (GUI) Intel 80386, for MS Windows” and has a compiled timestamp of “2017–05–04 22:34:46”, shown in the gray node above.

Researching Relationships

Now that we have our single sample set working; let’s go ahead and load some additional samples into the graph database to identify other potential relationships or patterns between the various WannaCry specimens.

Compile Times

One common attribute of PE files we can quickly pivot on is the compile timestamp. When working with malware from the same family, you may be able to see a trend in compile times as newer variants are built and used in the wild. We can see a small clustering of these groups below:


Another attribute we can use to visualize malware relationships is “detections”, which is produced as an output of peframe. The graph below outlines the 19 malware samples and the four main detections (i.e. mutex, antibg, xor and packer).

Going beyond static

For this post, we only focused on a handful static attributes (dead code analysis, not running the malware). We could further enhance the attributes and relationships in the graph by including other data sources such as VirusTotal, Cuckoo Sandbox (dynamic analysis) and Xori output. As always, I hope that post was informative and happy hunting!

Special Thanks

Thanks to @omgapt and @jeffochan7 for the assist on the post.

Additional Resources

Posting on various topics including incident response, malware analysis, development and finance/investing automation.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store