Using Graphs to Search for Code
02 Jul 2022 Tagged: programming devtoolsSome time ago, I was working on a server to generate images from weather RADAR data (a separate post on this will come at some point). As part of this, I spent a few hours profiling my code and found a tiny “bug” in the open source library I was using to parse one type of RADAR data.
To summarize, the library was doing
data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, &data)
when an equivalent but much faster way of doing this is
data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, data)
Notice the difference?
Passing &data
instead of just data
caused binary.Read
to take nearly twice as long, and the function this was in was responsible for the vast majority of the request runtime. That one character decreased throughput by nearly 40%!
Aside: you may be wondering, why is passing a pointer to the array so much slower? It’s because Go’s binary.Read has a fast-path for array types, but a pointer to an array ends up taking a much slower
reflect
based path.
Finding this got me thinking: this seems like something that could very easily slip in to other code. Is there any existing way to quickly search over code when you find a bug or “anti-pattern” like this?
Existing Tools
There’s three broad categories that existing tools seem to fit in:
- Highly advanced program analysis toolkits which build per-project databases (potentially compiling them in the process):
- AST-based matching tools
- “Simple” grep-like tools
- Sourcegraph
grep
None of these address everything that’s needed to do what I want though. The program analysis toolkits by their nature don’t allow for easy querying for “all uses of symbol X across all repos” without consuming a ton of compute resources running the query on each project independently. Similarly, the simple AST tools may have enough information to run the query (if they do basic type deduction), but as far as I know, are all built as “interactive” tools that once again only run on one repo at a time and don’t index anything. Lastly, the purely textual tools can have indexes built, but they don’t have the type information required to check for the “argument is a pointer to an array” part of our query.
I wanted something that has the intelligence of the first group (type information, data flow, import resolution, etc.) with the scalability of the second (tens of thousands of projects queryable in seconds).
So I wrote it! Introducing go-graph, because I’m bad at naming things.
Goals
Overview
For now, this project is only concerned with Go code. Why?
- Go has a large open source community, so there’s plenty of code to search
- Go is a simple language (for good or for bad), which makes our AST parsing easy
- Go also ships with all of the libraries we need to parse and analyze Go code because the compiler is self-hosted
- Go’s lack of a preprocessor means that a specific file in a package can only ever be compiled one way - there’s no chance for an
#ifdef
or similar to change the file - Lastly, but perhaps most importantly: the motivating bug was in Go code and I wanted to keep the scope of this project relatively tight since there’s basically no upper bound on how complex it could get
Now with that established, let’s talk about how it works.
Implementation
First, the schema. go-graph indexes:
- Metadata (source URL, and version) about each indexed Go package
- The functions in each package
- The statements in each function, their raw source text, and their successor(s) (i.e. the control flow graph)
- Any function calls which happened in a statement, with the resolved target(s) of the call
- Variables (name and type), as well as relations from each variable to where it’s defined and to all statements in which it’s referenced
This is more than enough to be able to run the original motivating query, and even allows for some more complex queries like searching for program slices that match some criteria (as we’ll see later).
Storage
Next, I had to decide how to store all of this data. A graph database seemed natural given the, well, graph structure of all of this information.
You could say I’m building a source-graph, but that name was already taken :)
Additionally, graph query languages are specifically built to let us easily write “multi-hop” (or even fully recursive) queries which is going to be a necessity moving between functions, their call sites, the statements for those call sites, the variables referenced in those statements, etc. This could all be done in a relational DB, but a graph DB is a much more natural starting point.
So which graph database to use?
I had been wanting to try out Janusgraph for a while, so I did an initial implementation using it mainly out of curiosity. It did work, however indexing was somewhat bottlenecked (topping out at ~80k vertices or edges per second, dipping as low as 20k/s), and the Go libraries for working with Gremlin queries (the language Janusgraph uses) are not amazing.
As part of a series of optimizations I did after initial development, I was looking at Neo4j (as it’s more or less the industry standard as far as I know), but ran across a forum post by by Ben Klein who was having issues with doing large-scale bulk inserts.
I mention this in particular because it just so happens the project that forum post talks about using Neo4j for is WorldSyntaxTree. The goal of that project is, in short, representing repositories, files, and parsed tree-sitter ASTs in a graph database - similar enough to what I’m doing that I figured if they were having issues with Neo4j, I probably would as well. Luckily for me, they had already tested and decided on another database, ArangoDB, so I followed in their footsteps.
Results
Indexing
After hacking on this on and off for a few weeks then coming back a few months later to do the aforementioned performance optimizations, I had a working version which did everything I needed.
I shallow cloned all Go repos on GitHub with >=100 stars (11,659 of them) giving me about 330GB of source, and started indexing. Before the optimization work, this took about 1.5 weeks to ingest, but as of now, all 11k repos can be indexed in about 11 hours, resulting in an ArangoDB data directory of just under 200GB. For reference, the Go indexing code was running on an 8 core Xeon with 32GB of RAM, and the database was running in a VM on a machine with a 10 core i9, also with 32GB of RAM.
Some fun statistics from the ingest process:
- Network traffic to the database averaged about 50Mbps, peaking at nearly 200Mbps
- After all was done, the database had:
- 219k distinct packages, 420k unique (package,version) pairs
- 27M functions
- 450M function calls
- 190M statements
- 126M variables
- 91M variable assigns
- 360M variable reference
- 185M statement “next” edges
Other Tools
I didn’t bother spending the time building CodeQL or Joern databases for each project as I’m almost certain that would have taken longer than my indexing did and I know the query times definitely wouldn’t satisfy the seconds-to-minutes requirement. Just starting/initializing either system takes multiple seconds, times tens of thousands of repos puts any queries well into the tens of minutes to hours range.
With those out, the only other tool that had all of the information needed to execute the original query was Semgrep. I let it rip over all of the code with this rule:
rules:
- id: untitled_rule
pattern: |
binary.Read(..., &($X : []$Y))
message: Semgrep found a match
languages: [go]
severity: WARNING
aaaandd it died:
$ time semgrep --metrics=off -c /tmp/semgrep_rule.yaml --verbose
...
====[ BEGIN error trace ]====
Raised at Stdlib__map.Make.find in file "map.ml", line 137, characters 10-25
Called from Sexplib0__Sexp_conv.Exn_converter.find_auto in file "src/sexp_conv.ml", line 156, characters 10-37
=====[ END error trace ]=====
...
3374.64s user 505.06s system 143% cpu 44:59.79 total
Running semgrep for each repo individually did work however, and it yielded a run time of about 2h15m:
$ time for d in *; do semgrep --metrics=off -c /tmp/semgrep_rule.yaml $d; done
...
8210.36s user 1246.15s system 115% cpu 2:16:59.63 total
It looks like some of semgrep’s initialization is single-threaded, so giving it the benefit of the doubt, we could expect ~13 minute query times if I had run 10 instances of this in parallel.
go-graph
I used the following Arango query to perform the search:
FOR p IN package
FILTER p.SourceURL == "encoding/binary"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "Read"
FOR callsite IN INBOUND f Callee
FOR statement IN OUTBOUND callsite CallSiteStatement
FOR var IN OUTBOUND statement References
FILTER STARTS_WITH(var.Type, "[]")
FILTER CONTAINS(statement.Text, CONCAT("&", var.Name))
FOR callfunc in INBOUND statement Statement
FOR callpkg in INBOUND callfunc Functions
RETURN {package: callpkg.SourceURL, file: statement.File, text: statement.Text, var: var.Name, type: var.Type}
Writing this out in English:
- Take the
encoding/binary
package - Traverse out to Functions named
Read
- Traverse to the call sites of
Read
, and out to the statement in which the call occurred - Traverse out to the variable referenced within that statement
- Filter for variables whose type starts with
[]
and which appear after a&
in the raw statement text - Traverse out from the call statement to the containing function, and then to the package containing said function
- Return the package, file, and statement in which the call happened, and the variable name and type which was passed incorrectly
And the performance?
…drumroll…
This query ran in only ~20s! Exactly in the timeframe I was looking for. Not quite “interactive” but also quick enough that you can iterate on queries without losing your train of thought.
Another one
One other potential use case that’s interested me is using go-graph to do “poor mans data flow analysis”, mainly to find examples of how you can go from one function/data type to another.
This is a somewhat contrived example, but let’s find all uses of crypto/rsa.GenerateKey
where the result flows through 0 or more intermediary variables to be used a pem.Encode
call:
// find calls to crypto/rsa.GenerateKey
FOR p IN package
FILTER p.SourceURL == "crypto/rsa"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "GenerateKey"
FOR call IN INBOUND f Callee
FOR srccallstmt IN OUTBOUND call CallSiteStatement
// Walk assign->ref->assign->ref->...
// until we reach a statement with an interesting call.
// v "alternates" between being a variable and being a statement
FOR v, e, path IN 1..5 OUTBOUND srccallstmt Assigns, INBOUND References
PRUNE CONTAINS(v.Text, "Encode")
OPTIONS {uniqueVertices: "path"}
// ensure that the end vertex is where we want
// quick check before doing any traversals
FILTER CONTAINS(v.Text, "Encode")
// now walk to the call site, called func,
// and ensure it's actually encoding/pem.Encode
FOR dstcallstmt IN INBOUND v CallSiteStatement
FOR dstcallfunc IN OUTBOUND dstcallstmt Callee
FILTER dstcallfunc.Name == "Encode"
FOR dstcallpkg IN INBOUND dstcallfunc Functions
FILTER dstcallpkg.SourceURL == "encoding/pem"
// ensure the "reference" is not actually an assignment
// go-graph considers a variable to be referenced
// even if it's on the left-hand side of an assignment
// which means `x, err := GenerateKey(); y, err := bar; Encode(y)`
// would match without this last filter since `err` is assigned
// in the first statement then also considered as "referenced"
// in the second
FILTER LENGTH(
FOR stmt IN path.vertices
FILTER IS_SAME_COLLECTION(statement, stmt)
FOR checkassign IN OUTBOUND stmt Assigns
FOR target IN path.vertices
FILTER IS_SAME_COLLECTION(variable, target)
FILTER POSITION(path.vertices, target, true) < POSITION(path.vertices, stmt, true)
FILTER target == checkassign
RETURN checkassign
) == 0
RETURN path
This query might be more complex than you’d expect, but despite that it works in a relatively timely fasion. 1000 results takes 2 seconds for up to 1 intermediate variable, 8 seconds for up to 2 intermediates, and 30 seconds for up to 3. For example, it returned this code in Gitea (where I originally got the idea to test from crypto/rsa.GenerateKey
to pem.Encode
):
priv, err = rsa.GenerateKey(rand.Reader, c.Int("rsa-bits"))
...
derBytes, err := x509.CreateCertificate(rand.Reader, &template, &template, publicKey(priv), priv)
...
err = pem.Encode(certOut, &pem.Block{Type: "CERTIFICATE", Bytes: derBytes})
But it also found a longer chain in a kubernetes cert utility:
caKey, err := rsa.GenerateKey(cryptorand.Reader, 2048)
...
caDERBytes, err := x509.CreateCertificate(cryptorand.Reader, &caTemplate, &caTemplate, &caKey.PublicKey, caKey)
...
caCertificate, err := x509.ParseCertificate(caDERBytes)
...
derBytes, err := x509.CreateCertificate(cryptorand.Reader, &template, caCertificate, &priv.PublicKey, caKey)
...
err := pem.Encode(&certBuffer, &pem.Block{Type: CertificateBlockType, Bytes: derBytes})
This ability could help you when you have a data source (e.g. GenerateKey
) and know where the data needs to end up (e.g. written to a PEM file), but can’t find any examples of what function(s) are needed to convert between (x509.CreateCertificate
in this case).
Improvements
go-graph currently doesn’t keep a graph of the full AST so there’s currently no “pure graph” way to find when a reference to a variable is taken. It wouldn’t be too hard to add, however I knew the trick of looking for &{variableName}
would work so I didn’t bother implementing it for now. Similarly, the position/context of a variable reference is not saved anywhere, so (barring more string comparisons) you’re unable to specify something like “&x is the 3rd argument”.
There are also “semantically equivalent” variants of the first query which aren’t captured. For instance:
data := make([]uint8, ldm)
data2 := &data
binary.Read(r, binary.BigEndian, data2)
For the original motivating query though, it doesn’t really matter as the mistake is going to be adding a &
in to the binary.Read
call, not passing a pre-existing variable of type *[]whatever
. It’s also entirely possible to change the query to look for that case of course if that is desired.
Closing Words
This level of search definitely isn’t for every use case, but it does fit nicely into a slot that I haven’t seen any other project fill. There’s a lot of details in the implementation not covered here, but they’re not really relevant to the overarching goal. I encourage you to go check out the code if you’re interested in the nitty gritty. I’m also happy to provide tarballs of the cloned repos and/or indexed database if anyone would like them to experiment on without having to clone and index everything themselves.
I hope you enjoyed reading! As always, feel free to drop me an email with any questions, suggestions, or other ideas.