Searching Code with a Graph Database

Some time ago, I was working on a server to generate images from weather RADAR data (a separate post on this will come at some point). As part of this, I spent a few hours profiling my code and found a tiny “bug” in the open source library I was using to parse one type of RADAR data.

To summarize, the library was doing

data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, &data)

when an equivalent but much faster way of doing this is

data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, data)

Notice the difference?

Passing &data instead of just data caused binary.Read to take nearly twice as long, and the function this was in was responsible for the vast majority of the request runtime. That one character decreased throughput by nearly 40%!

Aside: you may be wondering, why is passing a pointer to the array so much slower? It’s because Go’s binary.Read has a fast-path for array types, but a pointer to an array ends up taking a much slower reflect based path.

Finding this got me thinking: this seems like something that could very easily slip in to other code. Is there any existing way to quickly search over code when you find a bug or “anti-pattern” like this?

Existing Tools

There’s three broad categories that existing tools seem to fit in:

  1. Highly advanced program analysis toolkits which build per-project databases (potentially compiling them in the process):
  2. AST-based matching tools
  3. “Simple” grep-like tools

None of these address everything that’s needed to do what I want though. The program analysis toolkits by their nature don’t allow for easy querying for “all uses of symbol X across all repos” without consuming a ton of compute resources running the query on each project independently. Similarly, the simple AST tools may have enough information to run the query (if they do basic type deduction), but as far as I know, are all built as “interactive” tools that once again only run on one repo at a time and don’t index anything. Lastly, the purely textual tools can have indexes built, but they don’t have the type information required to check for the “argument is a pointer to an array” part of our query.

I wanted something that has the intelligence of the first group (type information, data flow, import resolution, etc.) with the scalability of the second (tens of thousands of projects queryable in seconds).

So I wrote it! Introducing go-graph, because I’m bad at naming things.

Goals

Overview

For now, this project is only concerned with Go code. Why?

Now with that established, let’s talk about how it works.

Implementation

First, the schema. go-graph indexes:

This is more than enough to be able to run the original motivating query, and even allows for some more complex queries like searching for program slices that match some criteria (as we’ll see later).

Storage

Next, I had to decide how to store all of this data. A graph database seemed natural given the, well, graph structure of all of this information.

You could say I’m building a source-graph, but that name was already taken :)

Additionally, graph query languages are specifically built to let us easily write “multi-hop” (or even fully recursive) queries which is going to be a necessity moving between functions, their call sites, the statements for those call sites, the variables referenced in those statements, etc. This could all be done in a relational DB, but a graph DB is a much more natural starting point.

So which graph database to use?

I had been wanting to try out Janusgraph for a while, so I did an initial implementation using it mainly out of curiosity. It did work, however indexing was somewhat bottlenecked (topping out at ~80k vertices or edges per second, dipping as low as 20k/s), and the Go libraries for working with Gremlin queries (the language Janusgraph uses) are not amazing.

As part of a series of optimizations I did after initial development, I was looking at Neo4j (as it’s more or less the industry standard as far as I know), but ran across a forum post by by Ben Klein who was having issues with doing large-scale bulk inserts.

I mention this in particular because it just so happens the project that forum post talks about using Neo4j for is WorldSyntaxTree. The goal of that project is, in short, representing repositories, files, and parsed tree-sitter ASTs in a graph database - similar enough to what I’m doing that I figured if they were having issues with Neo4j, I probably would as well. Luckily for me, they had already tested and decided on another database, ArangoDB, so I followed in their footsteps.

Results

Indexing

After hacking on this on and off for a few weeks then coming back a few months later to do the aforementioned performance optimizations, I had a working version which did everything I needed.

I shallow cloned all Go repos on GitHub with >=100 stars (11,659 of them) giving me about 330GB of source, and started indexing. Before the optimization work, this took about 1.5 weeks to ingest, but as of now, all 11k repos can be indexed in about 11 hours, resulting in an ArangoDB data directory of just under 200GB. For reference, the Go indexing code was running on an 8 core Xeon with 32GB of RAM, and the database was running in a VM on a machine with a 10 core i9, also with 32GB of RAM.

Some fun statistics from the ingest process:

Other Tools

I didn’t bother spending the time building CodeQL or Joern databases for each project as I’m almost certain that would have taken longer than my indexing did and I know the query times definitely wouldn’t satisfy the seconds-to-minutes requirement. Just starting/initializing either system takes multiple seconds, times tens of thousands of repos puts any queries well into the tens of minutes to hours range.

With those out, the only other tool that had all of the information needed to execute the original query was Semgrep. I let it rip over all of the code with this rule:

rules:
- id: untitled_rule
  pattern: |
      binary.Read(..., &($X : []$Y))
  message: Semgrep found a match
  languages: [go]
  severity: WARNING

aaaandd it died:

$ time semgrep --metrics=off -c /tmp/semgrep_rule.yaml --verbose
...
====[ BEGIN error trace ]====
Raised at Stdlib__map.Make.find in file "map.ml", line 137, characters 10-25
Called from Sexplib0__Sexp_conv.Exn_converter.find_auto in file "src/sexp_conv.ml", line 156, characters 10-37
=====[ END error trace ]=====
...
3374.64s user 505.06s system 143% cpu 44:59.79 total

Running semgrep for each repo individually did work however, and it yielded a run time of about 2h15m:

$ time for d in *; do semgrep --metrics=off -c /tmp/semgrep_rule.yaml $d; done
...
8210.36s user 1246.15s system 115% cpu 2:16:59.63 total

It looks like some of semgrep’s initialization is single-threaded, so giving it the benefit of the doubt, we could expect ~13 minute query times if I had run 10 instances of this in parallel.

go-graph

I used the following Arango query to perform the search:

FOR p IN package
FILTER p.SourceURL == "encoding/binary"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "Read"
FOR callsite IN INBOUND f Callee
FOR statement IN OUTBOUND callsite CallSiteStatement
FOR var IN OUTBOUND statement References
FILTER STARTS_WITH(var.Type, "[]")
FILTER CONTAINS(statement.Text, CONCAT("&", var.Name))
FOR callfunc in INBOUND statement Statement
FOR callpkg in INBOUND callfunc Functions
RETURN {package: callpkg.SourceURL, file: statement.File, text: statement.Text, var: var.Name, type: var.Type}

Writing this out in English:

  1. Take the encoding/binary package
  2. Traverse out to Functions named Read
  3. Traverse to the call sites of Read, and out to the statement in which the call occurred
  4. Traverse out to the variable referenced within that statement
  5. Filter for variables whose type starts with [] and which appear after a & in the raw statement text
  6. Traverse out from the call statement to the containing function, and then to the package containing said function
  7. Return the package, file, and statement in which the call happened, and the variable name and type which was passed incorrectly

And the performance?

…drumroll…

This query ran in only ~20s! Exactly in the timeframe I was looking for. Not quite “interactive” but also quick enough that you can iterate on queries without losing your train of thought.

Another one

One other potential use case that’s interested me is using go-graph to do “poor mans data flow analysis”, mainly to find examples of how you can go from one function/data type to another.

This is a somewhat contrived example, but let’s find all uses of crypto/rsa.GenerateKey where the result flows through 0 or more intermediary variables to be used a pem.Encode call:

// find calls to crypto/rsa.GenerateKey
FOR p IN package
FILTER p.SourceURL == "crypto/rsa"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "GenerateKey"
FOR call IN INBOUND f Callee
FOR srccallstmt IN OUTBOUND call CallSiteStatement

// Walk assign->ref->assign->ref->...
// until we reach a statement with an interesting call.
// v "alternates" between being a variable and being a statement

FOR v, e, path IN 1..5 OUTBOUND srccallstmt Assigns, INBOUND References
    PRUNE CONTAINS(v.Text, "Encode")
    OPTIONS {uniqueVertices: "path"}

// ensure that the end vertex is where we want
// quick check before doing any traversals
FILTER CONTAINS(v.Text, "Encode")
// now walk to the call site, called func,
// and ensure it's actually encoding/pem.Encode
FOR dstcallstmt IN INBOUND v CallSiteStatement
FOR dstcallfunc IN OUTBOUND dstcallstmt Callee
FILTER dstcallfunc.Name == "Encode"
FOR dstcallpkg IN INBOUND dstcallfunc Functions
FILTER dstcallpkg.SourceURL == "encoding/pem"

// ensure the "reference" is not actually an assignment
// go-graph considers a variable to be referenced
// even if it's on the left-hand side of an assignment
// which means `x, err := GenerateKey(); y, err := bar; Encode(y)`
// would match without this last filter since `err` is assigned
// in the first statement then also considered as "referenced"
// in the second
FILTER LENGTH(
    FOR stmt IN path.vertices
    FILTER IS_SAME_COLLECTION(statement, stmt)
    FOR checkassign IN OUTBOUND stmt Assigns
    FOR target IN path.vertices
    FILTER IS_SAME_COLLECTION(variable, target)
    FILTER POSITION(path.vertices, target, true) < POSITION(path.vertices, stmt, true)
    FILTER target == checkassign
    RETURN checkassign
) == 0
RETURN path

This query might be more complex than you’d expect, but despite that it works in a relatively timely fasion. 1000 results takes 2 seconds for up to 1 intermediate variable, 8 seconds for up to 2 intermediates, and 30 seconds for up to 3. For example, it returned this code in Gitea (where I originally got the idea to test from crypto/rsa.GenerateKey to pem.Encode):

priv, err = rsa.GenerateKey(rand.Reader, c.Int("rsa-bits"))
...
derBytes, err := x509.CreateCertificate(rand.Reader, &template, &template, publicKey(priv), priv)
...
err = pem.Encode(certOut, &pem.Block{Type: "CERTIFICATE", Bytes: derBytes})

But it also found a longer chain in a kubernetes cert utility:

caKey, err := rsa.GenerateKey(cryptorand.Reader, 2048)
...
caDERBytes, err := x509.CreateCertificate(cryptorand.Reader, &caTemplate, &caTemplate, &caKey.PublicKey, caKey)
...
caCertificate, err := x509.ParseCertificate(caDERBytes)
...
derBytes, err := x509.CreateCertificate(cryptorand.Reader, &template, caCertificate, &priv.PublicKey, caKey)
...
err := pem.Encode(&certBuffer, &pem.Block{Type: CertificateBlockType, Bytes: derBytes})

This ability could help you when you have a data source (e.g. GenerateKey) and know where the data needs to end up (e.g. written to a PEM file), but can’t find any examples of what function(s) are needed to convert between (x509.CreateCertificate in this case).

Improvements

go-graph currently doesn’t keep a graph of the full AST so there’s currently no “pure graph” way to find when a reference to a variable is taken. It wouldn’t be too hard to add, however I knew the trick of looking for &{variableName} would work so I didn’t bother implementing it for now. Similarly, the position/context of a variable reference is not saved anywhere, so (barring more string comparisons) you’re unable to specify something like “&x is the 3rd argument”.

There are also “semantically equivalent” variants of the first query which aren’t captured. For instance:

data := make([]uint8, ldm)
data2 := &data
binary.Read(r, binary.BigEndian, data2)

For the original motivating query though, it doesn’t really matter as the mistake is going to be adding a & in to the binary.Read call, not passing a pre-existing variable of type *[]whatever. It’s also entirely possible to change the query to look for that case of course if that is desired.

Closing Words

This level of search definitely isn’t for every use case, but it does fit nicely into a slot that I haven’t seen any other project fill. There’s a lot of details in the implementation not covered here, but they’re not really relevant to the overarching goal. I encourage you to go check out the code if you’re interested in the nitty gritty. I’m also happy to provide tarballs of the cloned repos and/or indexed database if anyone would like them to experiment on without having to clone and index everything themselves.

I hope you enjoyed reading! As always, feel free to drop me an email with any questions, suggestions, or other ideas.