Nick Gregory

Bismuth: Building a Cloud

2024-06-29T00:00:00+00:00

This year, a good friend and I have been working on a new startup: Bismuth.

We’re building a cloud - not like AWS or GCP though which have hundreds of services to pick from, but an opionated platform which has the essentials built in, and is designed to make it fast and easy to go from 0 to 100. There’s no setting up secrets manager, no configuring CloudTrail, nor even creating S3 buckets. Just one click to deploy your code (potentially written by our LLM), and you have implicit access to configuration & secrets, a K/V service, and blob storage, with a couple more services coming soon.

I know I’ve had many times where I just want to put some code up somewhere and have it run, but every time I either ran into the aforementioned cloud “feature” of needing to setup 4 different services to serve “Hello World” in a Lambda, or had to spend hours dealing with CI/CD to get docker images building, stored somewhere, and finally actually running. And that’s all before we get into needing fancy things like databases :)

Please checkout the website, blog (for which we have some really good technical articles planned about our process building, and the internals of Bismuth), or give the platform a go. Let us know what you think!

Warpspeed: Building a Record/Replay Debugger for macOS

2024-06-23T00:00:00+00:00

This is a (long) overdue post to accompany the REcon 2023 talk Pete Markowsky and I gave talking about our work on Warpspeed: a time travel debugger for macOS.

What is time travel debugging?

Time travel debuggers (also called record/replay debuggers) add an extra “dimension” to normal debuggers - the ability to go back in time in a debugging session. This makes it significantly easier to determine causality - what first triggered the bug in question. In some cases, this may be very obvious (e.g. an inverted logic condition), but in others the underlying trigger may have happened long ago, in a different context, or even on a different thread entirely.

The original motivation to create Warpspeed in the first place came about from the pain of trying to debug a very rare crash in Santa. At seemingly random times, one of Santa’s non-critical processes would end up in what seemed like a physically impossible state. It took weeks of looking at this code on and off until we eventually figured out what was going on: an Objective-C callback was writing into a stack-local boolean which may not still be valid when the callback was run. In most cases, the bug didn’t trigger (since the write only happened on an error path), and even if the error was hit the *bool = true may have been completely innocuous flipping unused or unimportant bits depending on what other threads were running. But if a thread was running just the right thing when this callback was processed, the bool would be written into the stack of a now different thread, corrupting it, and causing the crashes we were seeing. These are the types of bugs where a time travel debugger makes it significantly easier to debug.

Prior Work

There have been time travel debuggers for decades at this point - it’s not a new idea. If you’re a Windows developer, you may have even used one before - WinDbg has had time travel debugging since 2017. On Linux, rr (which stemmed out of Mozilla) is probably the most well known, but there’s a handful more including DetTrace, more recently Hermit, and even GDB, though GDB’s built-in TTD has been… lacking… in my experience.

Windows/WinDbg has first-party magic involved to make it work (I’m actually not sure about the details of this ¯\_(ツ)_/¯), but the primary mechanism for all of Linux debuggers above is the ptrace subsystem which provides easy syscall interception plus trapping of non-deterministic CPU instructions (e.g. rdtsc). This combined with the ability to set thread affinity (preventing threads from running and/or forcing serialization) and hardware performance counters (to measure and reproduce events coming into the program from the outside world) makes it possible to build record/replay debuggers on Linux.

So what about macOS?

Why would we do this to ourselves?

macOS has a documented history of having lackluster ptrace. And by this I mean an almost non-existent implementation. It also doesn’t have any way to limit processes to specific cores, has limited PMC facilities, etc. Not off to a great start.

macOS (and the BSDs in general) do have dtrace however. We originally experimented with using dtrace to hook syscalls and traps and got pretty far with this, but eventually realized it wasn’t going to be the final answer for a couple reasons. We either had to have the dtrace program send a POSIX signal to the target (using raise()) and intercept that (using waitpid or similar), or use another dtrace call to suspend the mach task (using stop()) and poll for that change in another process since there’s no way to receive notifications of mach suspend happening. Both of these methods also only freeze the program after the syscall has run, meaning any memory collection of state before the syscall would need to be collected by the dtrace program itself. Lastly, dtrace also didn’t provide a way to force serialization (controlling thread preemption) so we were stuck.

We then played around with the idea of pure userland interception. The main realization here is that the macOS ABI isn’t at the syscall layer like it is on Linux, but rather at the libSystem layer - a dylib just like any other. Hooks could be added to libSystem_kernel to intercept all in userland (letting us gather any program state we could possibly need), but we still had issues with things like threading. We could intercept the various XNU thread creation calls and implement our own scheduler, but we still wouldn’t have visibility into early process start, making some things non-deterministic before we could get a chance to do anything (e.g. malloc entropy, pointer munge value, etc.)

Eventually after trying to figure out how else we could limit concurrency, we had another thought: what if we put the userland application into a VM? By definition the guest would have to trap out to receive any data which could be non-deterministic. This lets us put the entire app “in a box” and use normal VMM facilities to intercept and log events from the app. This also means we can completely control scheduling by just swapping vCPU register state between threads as is done normally by the kernel.

Late Night(s) Hacking

At first glance this might appear like a pretty simple thing to do - you just need to load the MachO into the VM’s physical address space and set up a trap to exit to the hypervisor when a syscall is performed, right? We even had a lot of reference code from the Darling Project implementing the loader, commpage, etc! Well…

We also need to load the dyld shared cache
- Can we just map the running cache into the VM?
  - No not allowed by the kernel
- Can we copy the running cache?
  - No because some globals are already initialized which confuses/breaks dyld init
- Can we use DYLD_SHARED_CACHE_DIR to have the guest dyld try to map a new cache itself?
  - No because it basically just asks the kernel which says there’s already one in memory
- Can we unpack the cache and ask dyld to load individual dylibs?
  - Yes but it doesn’t work because dylib fixups
- Time to go implement a DSC loader…
… and we fault on some mem write in dyld (on an atomic I believe)
- The ARM memory model requires virtual memory/page tables setup so the CPU knew how to treat the memory (i.e. (non)-gathering, (non)-reorderable, (non)-early write ack).
- Great now go implement page tables
- Debug an issue for 2 days due to a missing bit causing faults
- Eventually switched to using a lightly modified version of Hyperpom which dealt with the page table management
  - And in the process rewrote the rest of the loading code in Rust (instead of the C it was in before)
And finally, we need to setup a few pages for thread local storage

I’m sure I’m missing stuff recounting this many months after the fact, but the point is: it was not easy.

After a couple weeks, I had gone through the above and had things mostly working. Now just to add recording of state before syscalls, after syscalls, persisting that to disk, and injecting it back in during a replay session…

In a time crunch (with only another few weeks until we were due to speak), we had another idea to keep implementation time down: instead of implementing semantics for each syscall and mach trap that macOS has, we can just “diff” memory before and after the host kernel handles a syscall and use that to automate most of the state recording. Since programs really only have 2 ways to get data (reading out of memory or performing a syscall), we only have to special case the things that can modify what memory is valid (e.g. mmap) - all memory reads/writes will be deterministic, and changes to the memory from syscalls are faked during replay. Since Warpspeed must special case syscalls which modify the memory maps (to map/unmap in the VM), it can recursively check if any potential register/memory address contains a valid pointer - a similar technique to how some garbage collectors work. For any addresses discovered, we snapshot the surrounding memory before the syscall and record the diff to the new memory after the syscall was processed.

With less than a day to spare, still writing code at the airport about to fly up to Montreal for the conference, this all finally came together for the demo: https://www.youtube.com/watch?v=Td5cQ6kGP5g.

Epilogue

After the talk, I refactored and cleaned up the VM management code and split it into a separate crate called AppBox. It contains the main logic for “putting apps in a box,” allowing for a number of other tools to be created on top. For example, one could implement something like gVisor for macOS using this library, limiting access to, or emulating bits of XNU. This division makes the entire Warpspeed record/replay system just an AppBox trap handler, responsible for the aforementioned pointer chasing, discovering, and recording of changed memory during syscalls.

If you want to play around with Warpspeed, clone the repo and run make to build and sign the CLI. You’ll need rust nightly installed. Then:

./target/release/warpspeed record -vvvv /tmp/trace /bin/ls -l

to record an execution of /bin/ls -l to the trace file /tmp/trace. Finally, replay it with

./target/release/warpspeed replay -vvvv /tmp/trace

And if all worked, you should be able to remove files, change directories, or do basically anything and still see the same output on replay.

The Future

There’s still plenty to do with Warpspeed - by no means is it production ready or even super useful. Replay of thread switching and external signals still needs to be implemented (along with all of the infrastructure to measure and break when some number of instructions has been run) and there are unanswered questions about how this should interact with the graphical components of macOS. However there should not be any technical limitations making this impossible, only a matter of (substantial) development time. Of which we have little right now 🙂

Improving Fuzzing Speed with userfaultfd

2022-12-09T00:00:00+00:00

About the same time I wrote up my previous post about snapshot fuzzing, I was thinking about other ways to restore program state for fuzzing, ideally in userland for ease of use.

There are of course many program side effects that need to be accounted for to restore program state perfectly: threads, files, timers, etc. However those all have to interact with the kernel in a way that can be intercepted with either libc hooks or with syscall (seccomp) hacks. And better yet for the purpose of fuzzing, they can often be disregarded and cleaned up in bulk after some large number of runs - for example, having extra open files shouldn’t break well-written programs.

The biggest challenge is restoring memory state since there’s no easy way to determine what memory has changed between runs from userland. The kernel can do this without too much effort (see the previous post), but this information isn’t easily accessible to userland. You could duplicate and restore all memory regions, however this doubles the running memory overhead of any fuzz target since it has to keep the pristine copy as well as the working copy. That may also end up taking a significant amount of time to reset between runs if there are hundreds of megabytes of shared libraries to be restored for example.

While looking at something entirely unrelated, I had the idea to use userfaultfd to do this memory dirtyness tracking, which could then be restored on a page-granular level after the program finished running.

userfaultfd?

In short, userfaultfd is a newer Linux-specific interface for userland pagefault handlers. Instead of having a single SIGSEGV handler and tweaking memory protections, userfaultfd allows memory to be registered to a userfaultfd object which is then polled by another thread (or even another process) to respond to those faults. It’s a much more flexible and performant way of handling page faults compared to a signal handler, and is perfect for our needs (except for a few small hindrances which can be worked around).

Implementation

To test the viability of this, I first created a minimal proof of concept which:

Duplicates all program memory (besides the relocation stub itself) into anonymous pages since userfaultfd cannot hook file-backed pages.
Registers each writeable (now anonymous) page with a userfaultfd object in write-protect mode. When one of these pages is written to, its address and contents are added to a simple statically allocated array for later restoration and its write protect bit is disabled.
Calls a target function/program.
Restores memory by iterating over the array, copying out the previously saved “pristine” content.
GOTO 3

I’m glossing over some details here (e.g. having to switch stacks when entering the snapshot/restore code so that the program doesn’t fault on its own stack and hang), but that’s the high level overview.

Benefits

This approach has many nice benefits, perhaps one of the most significant is actually something I didn’t realize until later. With this method, since the dirty page list is kept between runs and the write protect bit is cleared after the first write to the page, the overhead of intercepting memory writes goes down as more iterations are run since the commonly written pages are already in the restore list and aren’t hooked in subsequent runs. This means fewer kernel context switches and more time spent actually running the target code.

After a few days of hacking on this I got it working, targeting a simple program which did a malloc and printed out the returned address to show that heap restoration worked. It also benchmarked quite nicely with restoration taking under 2 microseconds, encouraging me on.

This proof-of-concept code is available here.

A Real Benchmark

As seems to be fuzzing tradition, I decided to ensure this worked on “real” programs by wrapping libjpeg-turbo. Specifically I was targeting djpeg converting an image of Tux into decompressed form and printing out the output.

In addition to the userfaultfd proof of concept, I also wrote up samples for a few other common ways of doing fuzzing to do comparisons against:

A simple fork server which just calls fork in a loop and then exec’s the target
The same as above, but using vfork
An “improved” fork server which is inline in the target program, allowing initialization to happen once and forking just before main is called - you may know this as persistent mode in AFL

Code for the userfaultfd and persistent mode versions is avilable in the branches of this repo.

Results

For 10,000 iterations of djpeg /tmp/tux.jpg:

Method	Median (ns)	Min (ns)	Max (ns)
fork	475737	457088	898557
vfork	442299	427317	815917
persistent	321325	311980	1610773
userfaultfd	174630	166653	734212

As these results show, even on a more complex program the userfaultfd technique resulted in a ~1.8x median performance increase over persistent mode, validating the idea!

Limitations

As with everything there are some limitations to the proof of concept:

Most notably, the write protect userfaultfd mode is currently implemented only for x86_64, meaning this approach will not work at all for ARM systems. I don’t believe there’s any technical reason for this however, so it could be added in the future. Additionally, this technique could be implemented by clearing PROT_WRITE on every page and setting a normal SIGSEGV handler, however this is notably slower (and much more annoying to implement) than userfaultfd.

Second, the fuzz framework would need to intercept mmap (or really any syscall which could alter memory mappings) and “do the right thing.” mprotect also needs to be hooked, and any pages being marked PROT_WRITE would need to be added to the userfaultfd before the mprotect returns. None of these are done in the proof of concept, but these wouldn’t be too hard to add.

Conclusion

While there is still more to be done to create a full implementation, this proof of concept shows that the strategy of using userfaultfd to reset program memory is viable, and even works as-is on moderately complex software. Being fully in userland, it should be possible to adopt this technique into source-available fuzzers like AFL(++) with relatively little maintenance work (compared to custom kernel modifications). I don’t have the time to do that much unfortunately, but hopefully someone does!

As always, feel free to reach out with any questions, suggestions, or if you happen to implement this technique in a real fuzzer :)

diffusion.gallery - A Constantly Changing Machine Generated Art Gallery

2022-09-10T00:00:00+00:00

tl;dr: diffusion.gallery is a website I put together which feeds random prompts from OpenAI into Stable Diffusion. It’s pretty neat.

Stable Diffusion

The past few weeks have been an exciting time for ML/AI (at least to an outside observer like myself). There’s been a staggering amount of innovations and experiments around Stable Diffusion, a new model which can synthesize images from text (similar to DALL-E) or even from other source images. You can use it to generate houses, “draw the rest of the owl”, or create some really cursed images just to give a few examples.

Trying It Out

I saw on Twitter that someone had added support for Apple Silicon to the image generation scripts, and since I daily-drive a M1 MacBook Pro that made it easy enough for me to try out. One of the first things I asked it to generate was some concept art for a place called the “Dreaming City” from one of my favorite games, Destiny 2. The results were really impressive, especially considering that I hadn’t done any prompt or parameter tuning:

The Idea

After messing around a bit more and seeing some of the things Stable Diffusion was able to produce, I thought it would be neat to have it constantly generating new art pieces and have them shown in a framed display on my wall, resembling art in a gallery. The first idea I had was for it to randomly pick from a handful of subjects, environments, and styles and ask the model to generate that. However after mentioning it to a friend, he suggested I go “full AI” and have an OpenAI model generate the prompt which Stable Diffusion then generates the image for.

After a few hours of tinkering, the result is diffusion.gallery.

diffusion.gallery

Every 5 minutes a new prompt and image is generated and uploaded to the gallery. The page will automatically refresh so you can leave it up all day if you want - every time you switch to it, odds are there will be a brand new piece.

The bottom right shows a description card for the piece including its “author” (the model that generated the image), its “title” (the timestamp at which it was generated), and the prompt passed to Stable Diffusion which generated the image.

N.B. The images are created at a 16:9 aspect ratio (1024x576) for ideal viewing on normal widescreen monitors.

For those wondering: the odd resolution is due to the fact that the dimensions must be divisible by 64, and this is the only 16:9 resolution (below 1080p) for which this is true.

Disclaimer

To try and avoid generating anything NSFW, the prompt to OpenAI explicitly requests that the resulting prompt (from which the image is generated) does not focus on any specific people. Combined with the safety classifier built in to Stable Diffusion, I don’t think anything generated will be offensive, but obviously it’s still possible that something bad comes out. Use the site at your own risk.

Some Pieces

After letting it run for just a few hours, I saw some really interesting images go by ranging from:

Hyper-realistic pictures

Prompt: A mostly barren landscape with a few jagged peaks in the distance. The sky is a harsh, unforgiving blue, and the air is cold and dry.

To dystopic drawings

Prompt: This painting is of an abandoned building in the middle of a dark desert. The subject is an old, crumbling building surrounded by nothing but empty, scorched earth. The painting is dark and moody, with a erie, atmospheric feel to it.

To impressionist paintings

Prompt: This painting depicts a bustling city street full of people and their belongings. The scene is brightly lit and colorful, and the buildings in the background are sharply silhouetted against the sky. The painting is undoubtedly impressionistic, with a loose, free style that allows the various elements to share the spotlight.

To the abstract

Prompt: This painting is of an abstract landscape with bright blues, greens, and oranges. It has a feeling of dynamism and energy, as if it is constantly moving. The style is impressionistic, with soft brush strokes and strong highlights.

And finally, the best example of AI: a prompt confidently describing a non-existant painting with a full background story and artist

Prompt: Salmon Run is an iconic painting by American painter Robert Henri. It depicts a panoramic view of the Columbia River, with salmon leaping upriver to spawn. The painting was commissioned by the Oregon Railroad and Navigation Company in 1914 as a celebration of the railroad’s 50th anniversary.

Framing It

To close out the project, I picked up a thin 15.6” OLED display and a cheap 11x17 frame. Some taping and wire management later, the gallery was now up on the wall.

Using Graphs to Search for Code

2022-07-02T00:00:00+00:00

Some time ago, I was working on a server to generate images from weather RADAR data (a separate post on this will come at some point). As part of this, I spent a few hours profiling my code and found a tiny “bug” in the open source library I was using to parse one type of RADAR data.

To summarize, the library was doing

data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, &data)

when an equivalent but much faster way of doing this is

data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, data)

Notice the difference?

Passing &data instead of just data caused binary.Read to take nearly twice as long, and the function this was in was responsible for the vast majority of the request runtime. That one character decreased throughput by nearly 40%!

Aside: you may be wondering, why is passing a pointer to the array so much slower? It’s because Go’s binary.Read has a fast-path for array types, but a pointer to an array ends up taking a much slower reflect based path.

Finding this got me thinking: this seems like something that could very easily slip in to other code. Is there any existing way to quickly search over code when you find a bug or “anti-pattern” like this?

Existing Tools

There’s three broad categories that existing tools seem to fit in:

Highly advanced program analysis toolkits which build per-project databases (potentially compiling them in the process):
- CodeQL
- Joern
AST-based matching tools
- Semgrep
- Weggli
“Simple” grep-like tools
- Sourcegraph
- grep

None of these address everything that’s needed to do what I want though. The program analysis toolkits by their nature don’t allow for easy querying for “all uses of symbol X across all repos” without consuming a ton of compute resources running the query on each project independently. Similarly, the simple AST tools may have enough information to run the query (if they do basic type deduction), but as far as I know, are all built as “interactive” tools that once again only run on one repo at a time and don’t index anything. Lastly, the purely textual tools can have indexes built, but they don’t have the type information required to check for the “argument is a pointer to an array” part of our query.

I wanted something that has the intelligence of the first group (type information, data flow, import resolution, etc.) with the scalability of the second (tens of thousands of projects queryable in seconds).

So I wrote it! Introducing go-graph, because I’m bad at naming things.

Goals

Overview

For now, this project is only concerned with Go code. Why?

Go has a large open source community, so there’s plenty of code to search
Go is a simple language (for good or for bad), which makes our AST parsing easy
- Go also ships with all of the libraries we need to parse and analyze Go code because the compiler is self-hosted
Go’s lack of a preprocessor means that a specific file in a package can only ever be compiled one way - there’s no chance for an #ifdef or similar to change the file
Lastly, but perhaps most importantly: the motivating bug was in Go code and I wanted to keep the scope of this project relatively tight since there’s basically no upper bound on how complex it could get

Now with that established, let’s talk about how it works.

Implementation

First, the schema. go-graph indexes:

Metadata (source URL, and version) about each indexed Go package
The functions in each package
The statements in each function, their raw source text, and their successor(s) (i.e. the control flow graph)
Any function calls which happened in a statement, with the resolved target(s) of the call
Variables (name and type), as well as relations from each variable to where it’s defined and to all statements in which it’s referenced

This is more than enough to be able to run the original motivating query, and even allows for some more complex queries like searching for program slices that match some criteria (as we’ll see later).

Storage

Next, I had to decide how to store all of this data. A graph database seemed natural given the, well, graph structure of all of this information.

You could say I’m building a source-graph, but that name was already taken :)

Additionally, graph query languages are specifically built to let us easily write “multi-hop” (or even fully recursive) queries which is going to be a necessity moving between functions, their call sites, the statements for those call sites, the variables referenced in those statements, etc. This could all be done in a relational DB, but a graph DB is a much more natural starting point.

So which graph database to use?

I had been wanting to try out Janusgraph for a while, so I did an initial implementation using it mainly out of curiosity. It did work, however indexing was somewhat bottlenecked (topping out at ~80k vertices or edges per second, dipping as low as 20k/s), and the Go libraries for working with Gremlin queries (the language Janusgraph uses) are not amazing.

As part of a series of optimizations I did after initial development, I was looking at Neo4j (as it’s more or less the industry standard as far as I know), but ran across a forum post by by Ben Klein who was having issues with doing large-scale bulk inserts.

I mention this in particular because it just so happens the project that forum post talks about using Neo4j for is WorldSyntaxTree. The goal of that project is, in short, representing repositories, files, and parsed tree-sitter ASTs in a graph database - similar enough to what I’m doing that I figured if they were having issues with Neo4j, I probably would as well. Luckily for me, they had already tested and decided on another database, ArangoDB, so I followed in their footsteps.

Results

Indexing

After hacking on this on and off for a few weeks then coming back a few months later to do the aforementioned performance optimizations, I had a working version which did everything I needed.

I shallow cloned all Go repos on GitHub with >=100 stars (11,659 of them) giving me about 330GB of source, and started indexing. Before the optimization work, this took about 1.5 weeks to ingest, but as of now, all 11k repos can be indexed in about 11 hours, resulting in an ArangoDB data directory of just under 200GB. For reference, the Go indexing code was running on an 8 core Xeon with 32GB of RAM, and the database was running in a VM on a machine with a 10 core i9, also with 32GB of RAM.

Some fun statistics from the ingest process:

Network traffic to the database averaged about 50Mbps, peaking at nearly 200Mbps
After all was done, the database had:
- 219k distinct packages, 420k unique (package,version) pairs
- 27M functions
- 450M function calls
- 190M statements
- 126M variables
- 91M variable assigns
- 360M variable reference
- 185M statement “next” edges

Other Tools

I didn’t bother spending the time building CodeQL or Joern databases for each project as I’m almost certain that would have taken longer than my indexing did and I know the query times definitely wouldn’t satisfy the seconds-to-minutes requirement. Just starting/initializing either system takes multiple seconds, times tens of thousands of repos puts any queries well into the tens of minutes to hours range.

With those out, the only other tool that had all of the information needed to execute the original query was Semgrep. I let it rip over all of the code with this rule:

rules:
- id: untitled_rule
  pattern: |
      binary.Read(..., &($X : []$Y))
  message: Semgrep found a match
  languages: [go]
  severity: WARNING

aaaandd it died:

$ time semgrep --metrics=off -c /tmp/semgrep_rule.yaml --verbose
...
====[ BEGIN error trace ]====
Raised at Stdlib__map.Make.find in file "map.ml", line 137, characters 10-25
Called from Sexplib0__Sexp_conv.Exn_converter.find_auto in file "src/sexp_conv.ml", line 156, characters 10-37
=====[ END error trace ]=====
...
3374.64s user 505.06s system 143% cpu 44:59.79 total

Running semgrep for each repo individually did work however, and it yielded a run time of about 2h15m:

$ time for d in *; do semgrep --metrics=off -c /tmp/semgrep_rule.yaml $d; done
...
8210.36s user 1246.15s system 115% cpu 2:16:59.63 total

It looks like some of semgrep’s initialization is single-threaded, so giving it the benefit of the doubt, we could expect ~13 minute query times if I had run 10 instances of this in parallel.

go-graph

I used the following Arango query to perform the search:

FOR p IN package
FILTER p.SourceURL == "encoding/binary"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "Read"
FOR callsite IN INBOUND f Callee
FOR statement IN OUTBOUND callsite CallSiteStatement
FOR var IN OUTBOUND statement References
FILTER STARTS_WITH(var.Type, "[]")
FILTER CONTAINS(statement.Text, CONCAT("&", var.Name))
FOR callfunc in INBOUND statement Statement
FOR callpkg in INBOUND callfunc Functions
RETURN {package: callpkg.SourceURL, file: statement.File, text: statement.Text, var: var.Name, type: var.Type}

Writing this out in English:

Take the encoding/binary package
Traverse out to Functions named Read
Traverse to the call sites of Read, and out to the statement in which the call occurred
Traverse out to the variable referenced within that statement
Filter for variables whose type starts with [] and which appear after a & in the raw statement text
Traverse out from the call statement to the containing function, and then to the package containing said function
Return the package, file, and statement in which the call happened, and the variable name and type which was passed incorrectly

And the performance?

…drumroll…

This query ran in only ~20s! Exactly in the timeframe I was looking for. Not quite “interactive” but also quick enough that you can iterate on queries without losing your train of thought.

Another one

One other potential use case that’s interested me is using go-graph to do “poor mans data flow analysis”, mainly to find examples of how you can go from one function/data type to another.

This is a somewhat contrived example, but let’s find all uses of crypto/rsa.GenerateKey where the result flows through 0 or more intermediary variables to be used a pem.Encode call:

// find calls to crypto/rsa.GenerateKey
FOR p IN package
FILTER p.SourceURL == "crypto/rsa"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "GenerateKey"
FOR call IN INBOUND f Callee
FOR srccallstmt IN OUTBOUND call CallSiteStatement

// Walk assign->ref->assign->ref->...
// until we reach a statement with an interesting call.
// v "alternates" between being a variable and being a statement

FOR v, e, path IN 1..5 OUTBOUND srccallstmt Assigns, INBOUND References
    PRUNE CONTAINS(v.Text, "Encode")
    OPTIONS {uniqueVertices: "path"}

// ensure that the end vertex is where we want
// quick check before doing any traversals
FILTER CONTAINS(v.Text, "Encode")
// now walk to the call site, called func,
// and ensure it's actually encoding/pem.Encode
FOR dstcallstmt IN INBOUND v CallSiteStatement
FOR dstcallfunc IN OUTBOUND dstcallstmt Callee
FILTER dstcallfunc.Name == "Encode"
FOR dstcallpkg IN INBOUND dstcallfunc Functions
FILTER dstcallpkg.SourceURL == "encoding/pem"

// ensure the "reference" is not actually an assignment
// go-graph considers a variable to be referenced
// even if it's on the left-hand side of an assignment
// which means `x, err := GenerateKey(); y, err := bar; Encode(y)`
// would match without this last filter since `err` is assigned
// in the first statement then also considered as "referenced"
// in the second
FILTER LENGTH(
    FOR stmt IN path.vertices
    FILTER IS_SAME_COLLECTION(statement, stmt)
    FOR checkassign IN OUTBOUND stmt Assigns
    FOR target IN path.vertices
    FILTER IS_SAME_COLLECTION(variable, target)
    FILTER POSITION(path.vertices, target, true) < POSITION(path.vertices, stmt, true)
    FILTER target == checkassign
    RETURN checkassign
) == 0
RETURN path

This query might be more complex than you’d expect, but despite that it works in a relatively timely fasion. 1000 results takes 2 seconds for up to 1 intermediate variable, 8 seconds for up to 2 intermediates, and 30 seconds for up to 3. For example, it returned this code in Gitea (where I originally got the idea to test from crypto/rsa.GenerateKey to pem.Encode):

priv, err = rsa.GenerateKey(rand.Reader, c.Int("rsa-bits"))
...
derBytes, err := x509.CreateCertificate(rand.Reader, &template, &template, publicKey(priv), priv)
...
err = pem.Encode(certOut, &pem.Block{Type: "CERTIFICATE", Bytes: derBytes})

But it also found a longer chain in a kubernetes cert utility:

caKey, err := rsa.GenerateKey(cryptorand.Reader, 2048)
...
caDERBytes, err := x509.CreateCertificate(cryptorand.Reader, &caTemplate, &caTemplate, &caKey.PublicKey, caKey)
...
caCertificate, err := x509.ParseCertificate(caDERBytes)
...
derBytes, err := x509.CreateCertificate(cryptorand.Reader, &template, caCertificate, &priv.PublicKey, caKey)
...
err := pem.Encode(&certBuffer, &pem.Block{Type: CertificateBlockType, Bytes: derBytes})

This ability could help you when you have a data source (e.g. GenerateKey) and know where the data needs to end up (e.g. written to a PEM file), but can’t find any examples of what function(s) are needed to convert between (x509.CreateCertificate in this case).

Improvements

go-graph currently doesn’t keep a graph of the full AST so there’s currently no “pure graph” way to find when a reference to a variable is taken. It wouldn’t be too hard to add, however I knew the trick of looking for &{variableName} would work so I didn’t bother implementing it for now. Similarly, the position/context of a variable reference is not saved anywhere, so (barring more string comparisons) you’re unable to specify something like “&x is the 3rd argument”.

There are also “semantically equivalent” variants of the first query which aren’t captured. For instance:

data := make([]uint8, ldm)
data2 := &data
binary.Read(r, binary.BigEndian, data2)

For the original motivating query though, it doesn’t really matter as the mistake is going to be adding a & in to the binary.Read call, not passing a pre-existing variable of type *[]whatever. It’s also entirely possible to change the query to look for that case of course if that is desired.

Closing Words

This level of search definitely isn’t for every use case, but it does fit nicely into a slot that I haven’t seen any other project fill. There’s a lot of details in the implementation not covered here, but they’re not really relevant to the overarching goal. I encourage you to go check out the code if you’re interested in the nitty gritty. I’m also happy to provide tarballs of the cloned repos and/or indexed database if anyone would like them to experiment on without having to clone and index everything themselves.

I hope you enjoyed reading! As always, feel free to drop me an email with any questions, suggestions, or other ideas.

Seeing the Clouds with the Cloud

2022-06-03T00:00:00+00:00

If you follow AWS closely, you may have heard about a niche product launch a few years back called Ground Station which lets you rent, well, a ground station (basically a big antenna plus supporting equipment to communicate with satellites). A friend recently linked me an AWS blog post with a sample use case which described using it as a way of receiving real time imagery from orbiting weather satellites. Now funny enough, receiving data from polar orbiting weather satellites has been a side project of mine for over a decade now, but living in NYC has put a bit of a hold on it. I used to have a home-built QFH antenna which I used to receive images with a surprisingly high success rate given the janky construction of it.

Yes, you’re seeing that correctly - it’s an antenna made of PVC tubing and coax duct-taped to the top of a pole for a basketball hoop. Crude but effective.

Anyways, the ability to use a remote antenna to downlink imagery piqued my interest, especially since these antennas would let me get the highest quality digital imagary sent out in the 8GHz X-band instead of the lower-quality analog APT transmissions around 137MHz that I had received in the past. So I set out to try and downlink a “true color” image.

I requested access to AWS ground station, but also found out about and filled out a request form to get access to Azure Orbital - Microsoft’s competing offering which is still in preview.

Onboarding

I never ended up hearing back from AWS after an initial email from them requesting details about my use case, however this is probably for the best as it costs $10/min to rent one of their antennas. With one pass of a polar orbiting satellite lasting anywhere from ~8-15 minutes, this would have gotten really expensive to be playing around on.

Since Azure Orbital was still in preview though, it was free to use! The Orbital team onboarded me to the preview quickly, however after a bit of back and forth trying to figure out why I was getting an error when a “contact” was supposed to start, I found out that they were only allowing downlinking from the NOAA AQUA satellite, not the weather-specific polar orbiting satellites (e.g. NOAA-20). This was fine though, as at the end of the day this was just an experiment and I had no need for the weather satellites in particular.

Trial and Error

While Azure was great about getting me on the platform, their docs were… lacking to say the least. It looks like they’ve added a small how-to guide in the months since I was experimenting which explains some of the questions I had, however it still doesn’t cover the last phase of actually demodulating and decoding the signal into usable data. It’s understandable since that’s not “relevant” to the service, but what good is it to receive data without doing something with it!

In case it helps anyone else though, I’ve put the questions I had and the answers the Orbital team gave back to me down below.

Data Ingest

Preface

Before diving in to details, I want to quickly go over the process for transforming radio signals into data, at least as it applies to receiving data from AQUA using Azure Orbital.

RF data is received by an antenna, digitized, and transmitted over the network as a series of I/Q data encapsulated in “VITA-49” packets. These packets include the raw data as well as a bunch of other metadata from the receiving system (things like timestamps, receiver gain(s), configured intermediate frequency, etc.).
The I/Q data is demodulated, in our case as a QPSK signal. This transforms the RF stream into one of four possible symbols.
The demodulated symbols are then decoded into complete data frames. This is where the data is first interpreted (synchronizing to the start of frames).
The frames are then checked for errors, the headers are parsed, and the frames are separated by the “virtual channel” (so data from multiple instruments can all share a common downlink and even be interleaved) then dispatched for processing.

There’s a lot to unpack in that if you’re new to this, but hopefully it helps make the rest of the blog at least a bit more understandable!

Receiving

As the Q&A at the end mentions, Orbital can do the demodulation and/or decoding for us, however the format for specifying how to do that is proprietary to a specific brand of modem (Kratos). Googling around a bit didn’t show any public documentation and I didn’t really feel like contacting them to try and get access to docs, so we’ll have to do this ourselves in software from the raw RF data.

First off, I setup socat on a small Azure VM to receive the data from the Orbital service and dump it to a file. I could have setup the listener on one of my personal machines, however given the bandwidth required to receive (~300Mbps) and the fact that I was on the opposite coast of the U.S., I opted to use a VM local to the Azure region the receiver was in to stage the data first.

$ socat TCP-LISTEN:1234,reuseaddr,fork 'SYSTEM:cat > raw-$(date +%s).dat'

After a pass, I manually uploaded the .dat file to object storage so that I could retrieve them later for processing on my machines at home.

Next, I needed to extract the raw data from the VITA-49 packets that Orbital actually sends. To do this, I wrote a Python script which parses the header of each packet (since the headers are variable length) and dumps the raw I/Q payload into a file since that’s all that matters for our purposes. This took a bit longer than you might think because the actual specification for VITA-49 costs a hundred dollars, and I hadn’t yet been pointed to the VITA-49 compatible (but free!) “Digital IF Interoperability Standard”.

Editors note: while trying to find the price for the VITA-49 spec again I ran across a draft of the spec which seems to cover everything. I guess my Google-fu was off…

The data extraction script can be found here

Demodulation

I started with a GNU Radio flowchart from the altillimity/X-Band-Decoders GitHub repo to demodulate the signal.

Aside: GNU Radio

GNU Radio is a open source software defined radio toolkit. It provides a ton of building blocks which can be chained together into flowcharts which implement any type of signal processing workload. For example, the flowchart used to do the demodulation looks like this:

Don’t get me wrong, this looks very intimidating at first. I definitely still don’t understand all of it! However it’s worlds better than trying to piece together the code to do all of this signal processing yourself.

This was probably the most finicky part of the entire process. When running any GNU Radio chart on my M1 Macbook Pro (GRC version 3.9.3.0), the graphs didn’t update at all and it seems like the entire thing froze when first starting? After spending way too long thinking that was a bug in the chart and trying everything I could think of to make it work, I eventually ran it on my x86 Linux laptop and the graphs were updating and it seemed to be doing something.

A few minutes in to processing (once the satellite was overhead) the frequency plot looked good:

but the constellation plot had a weird “double image” and was showing eight clusters instead of the expected four (since this is QPSK). The demodulated data that was coming out was also not parsable by any tool I found - it seemed to be complete garbage.

Suspecting that this had something to do with clock recovery (matching the exact rate at which symbols are sampled to the rate the satellite is sending them out), I found a blog post after some Googling around that describes what the “Clock Recovery MM” block was actually doing under the hood. Applying the things talked about in that and tweaking the block parameters, I got slightly better output however it still wasn’t great. The decode tools were getting sync, but nearly every frame was corrupted. Finally, I saw on the “Clock Recovery MM” GNU Radio wiki page that that block was actually deprecated in favor of a new “Symbol Sync” block. I swapped that in and tried a few different algorithms, eventually settling on zero crossing which produced a great looking constellation and got the decode tools to start emitting uncorrupted frames.

The final flowchart is available here.

Decoding

Per the original AWS blog post, NASA’s RT-STPS toolkit is the “official” way to decode data from AQUA (and other) satellites. Unfortunately, despite it saying it got lock on the demodulated data, every frame it processed was “unroutable.” I dug into the source and eventually set a watchpoint where the satellite ID is extracted from the frame headers (the satellite ID being how it decides to route data), and the ID was all wrong. I’m still unsure why this was, but I didn’t want to spend much more time on it as the tooling in the aforementioned X-Band-Decoders repo already had decoding and data separation utilities for the MODIS data (which was all I needed to produce the simple true color image I was going for).

After waiting on a few dependencies to build, these tools worked first try yielding a stream of uncorrupted MODIS image data frames. Nice!

Rendering an Image

The X-Band-Decoders repo was once again helpful, pointing me to weathersat. As the README in the repo says,

If you don’t read this README with attention, as well as the

./hrpt.exe –help

output, you will (!!) fail to successfully run the s/w. Especially the environment variables described below are crucial !!!!

Promptly ignoring this, I spent an hour or so trying to get it to work to little avail.

Going back and looking at the output of --help though, there’s a nice example of how to use the utility to render a real color image from MODIS data - exactly what I wanted! After stumbling over a couple last things (spaces in directory names breaking stuff and a missing trailing / in the necessary envvars), I had an image:

This process having taken a few days to perfect, I also had another capture ready to process. Running it through resulted in another pretty decent image:

(For reference on what you’re seeing geographically speaking, the peninsula visible at the bottom of the images is Baja California)

Success!

Other Captures

I received data for a total of five AQUA passes (weighing in at ~100GB total!), however only two of them had usable data. I’m sure there’s more tweaking that could be done in the demod/decode steps which would probably yield more usable frames, but even the images I produced above have significant bands of little to no reception. As the images show, it was somewhat cloudy over the datacenter the antenna was located at (and these captures were all done within a few days of eachother), so perhaps the weather was interfering? Given that these are relatively high frequency signals (8.16GHz), I think atmospheric conditions could have an effect…

Either way, I got a couple cool images so I was very much content :)

Conclusion

This entire experiment occurred over the span of about two months from first requesting access to getting images out, but out of that span I only spent about three days actively working on it. Surprisingly quick for such a project I think!

All in all, it was quite a fun thing to spend some time on. I learned quite a bit more about software defined radio (hopefully you did as well!) and more than I would have ever liked to about VITA-49.

As always, feel free to reach out with any questions or feedback. I also still have the raw capture data if anyone would like a copy of it to experiment with the demod/decode/render steps themselves.

Appendix: Orbital Q&A

Q: What is the Gain/Temperature field? As far as I know this is normally a characteristic of the receiving system, not a tunable parameter?

A: The G/T field is a requirement passed by the user to the system. So you aren’t setting the G/T but rather requesting a min G/T spec. This is because Orbital integrates across many first party and partner sites with various antenna sizes. So if you needed a certain bar of performance when you query availability from site to site you can have that guarantee by specifying whatever G/T your link needs. We are not filtering on this yet in the near-term so feel free to put a placeholder value here.

Q: What is the format for the demodulation and decoding configuration?

A: The argument is an unvalidated blob or string type that is a copy/paste of the modem config file. Right now we offer Kratos modems in this mode.

Q: How is the data received actually encoded?

A: Azure Orbital leverages DIFI for its RF transport layer. Those details can be downloaded here at https://dificonsortium.org/, and Microsoft had a significant hand in the creation of this consortium. To that effect, our SDR team has released the GNU Radio Azure Software Radio toolbox publicly available on GitHub. This lets you interface directly with Orbital in GNU Radio without any need for manual coding or modding! All you have to do is specify your VM with this toolbox loaded as the endpoint. Check it out here: https://github.com/microsoft/azure-software-radio

N.B. The Azure Software Radio toolbox only supports reading data from a socket (doing all of the processing as it’s streaming in), or from a Azure blob storage file.

The Discovery and Exploitation of CVE-2022-25636

2022-03-12T00:00:00+00:00

A few weeks ago, I found and reported CVE-2022-25636 - a heap out of bounds write in the Linux kernel. The bug is exploitable to achieve kernel code execution (via ROP), giving full local privilege escalation, container escape, whatever you want.

In this post, I cover the entire process of finding and exploiting the bug (to as much of an extent as I did at least) - from initial “huh that looks weird” to a working LPE.

It’s a long post, but hopefully this will be useful to others (especially those newer to kernel exploitation) to get a feel for what my process was like.

Finally, if you’re just here for the exploit details and don’t want the backstory of me discovering it, feel free to skip head.

Bug Hunting

One night a few weeks back, I was bored. There were a few other projects I could have worked on, but none of them seemed particularly interesting, so I decided to do some random (kernel) code review. There have been a few notable bugs in the netfilter kernel subsystem that I’ve seen over the past few years (perhaps most notably CVE-2021-22555), so I decided to start looking there. It’s a relatively complex subsystem that’s widely available - the perfect target.

Aside: What is netfilter?

Netfilter, as the project’s website says, “enables packet filtering, network address [and port] translation (NA[P]T), packet logging, userspace packet queueing and other packet mangling.” You’ve probably interacted with netfilter before without knowing about it! Ever used iptables to block inbound traffic on a server, or configured a Linux box as a router with NAT? All of that packet processing is done in the kernel by netfilter.

I’ve done a bunch of stuff with iptables in the past, but other than that I wasn’t familiar with anything else netfilter provided (and definitely didn’t know anything about how it worked), so I clicked around on some files in the subsystem’s main source directory to try and get a lay of the land.

I started at the top by looking at a few of the (what seemed to be) protocol parsers. Parsing non-trivial data is always potentially error-prone, so it felt like a good place to start. I ended up focusing on the parts of the code taking configuration input from userland (over a netlink socket), as while a bug in packet processing would be interesting, the decoder would still have to be “activated” by some configuration from userland in the first place.

Editor’s note: perhaps it’s worth taking another look at these since syzkaller doesn’t show much of any coverage on these files so maybe there’s something lurking…

Anyways, after going through nf_conntrack_ftp.c and a few others without seeing much interesting, I was scrolling through looking for other “types” of code and saw nf_dup_netdev.c. I was actually just about to click on some other file when I saw that and thought “well maybe there could be some refcounting bug if something is duplicated” so I decided to look in there.

It’s quite a short file, but line 67

entry = &flow->rule->action.entries[ctx->num_actions++];

stood out to me for two particular reasons:

It was incrementing ctx->num_actions and using it as the index into an array without any bounds checking
The index (ctx->num_actions) and the array itself (flow->rule->action.entries) are struct members of two completely different variables, not obviously related. That is, the line is equivalent to a->b[x->y] which seems potentially more “suspicious” than a->b[a->c].

None of these reasons made this a definite bug (yet) of course, however it definitely “smelled” which prompted a bit more digging.

Is It a Bug?

I had a few immediate questions:

What determines the size of the action.entries array?
How is nft_fwd_dup_netdev_offload called? And what controls how many times it’s called?
When/how is ctx initialized?

At this point I also realized that this was in nft_fwd_dup_netdev_offload. Even if this bug was real, it may only be reachable on systems with Network Interface Cards (NICs) with support for packet processing offload, which are very rare (and very expensive). It would still be a bug, but maybe not the most interesting bug in the world.

Pulling up the x-refs of nft_fwd_dup_netdev_offload showed it was called in a .offload handler of the dup and fwd nft_expr_types. Looking at the references for the offload struct member (which is really not a pleasant experience in Elixir…), I found this use which answered all but one of the questions above:

ctx = kzalloc(sizeof(struct nft_offload_ctx), GFP_KERNEL);

...

while (nft_expr_more(rule, expr)) {
  if (!expr->ops->offload) {
    err = -EOPNOTSUPP;
    goto err_out;
  }
  err = expr->ops->offload(ctx, flow, expr);
  if (err < 0)
    goto err_out;

  expr = nft_expr_next(expr);
}

How is nft_fwd_dup_netdev_offload called?: It’s indirectly called as part of nft_flow_rule_create.
What controls how many times nft_fwd_dup_netdev_offload is called?: Offload handlers (and therefore nft_fwd_dup_netdev_offload for fwd/dup expressions) are called for every expression in the rule which has one. No other checks.
When/how is ctx initialized?: For each rule created, the context is zero-initialized and the same instance is passed to each offload handler.

More importantly than all of those however, the answer to the most interesting question was just above:

expr = nft_expr_first(rule);
while (nft_expr_more(rule, expr)) {
  if (expr->ops->offload_flags & NFT_OFFLOAD_F_ACTION)
    num_actions++;

  expr = nft_expr_next(expr);
}

...

flow = nft_flow_rule_alloc(num_actions);

We see that for each expression in the rule, a num_actions counter is incremented only when the expression has a certain bit (NFT_OFFLOAD_F_ACTION) set in ops->offload_flags. Quickly checking back at the definition for the dup and fwd expressions, neither of them have NFT_OFFLOAD_F_ACTION set. In fact, there’s only one use of NFT_OFFLOAD_F_ACTION at all: in the immediate expression type (here).

At this point I was pretty confident there was a bug. As far as I could tell, the only thing that could prevent it would be if there was some enforcement of having one immediate per dup/fwd rule.

Checking for Exploitability

Unfamiliar with how to “talk” to nftables, I searched around for some examples of what a nftables table/chain definition looks like and how to install it. One mailing list post was particularly useful as it had everything needed, including how to set the offload flag which is required to reach the bug (because of this check).

table netdev filter_test {
  chain ingress {
    type filter hook ingress device eth0 priority 0; flags offload;
    ip daddr 192.168.0.10 tcp dport 22 drop
  }
}

With that sample in hand, I started playing around with nftables to see if/how the bug could be hit.

First, I setup a kprobe on flow_rule_alloc (responsible for creating our action.entries array) with a fetcharg to show the num_actions argument: sudo kprobe-perf -F 'p:flow_rule_alloc num_actions=%di:u32'. This immediately failed because (at least on Ubuntu) nftables is a lazily loaded kernel module so the code wasn’t actually loaded yet. Oops. After quickly running nft -a mailing_list.nft (which forced the kernel module to load even though the command itself failed), I could actually set the kprobe.

Running nft -a mailing_list.nft for real this time resulted in a kprobe hit (despite the rule installation failing):

$ sudo nft -f mailing_list.nft
a.nf:1:1-2: Error: Could not process rule: Operation not supported
table netdev x {
^^

$ sudo kprobe-perf 'p:flow_rule_alloc num_actions=%di:u32'
Tracing kprobe flow_rule_alloc. Ctrl-C to end.
             nft-20137   [001] .... 1573655.306178: flow_rule_alloc: (flow_rule_alloc+0x0/0x60) num_actions=1

So flow_rule_alloc was indeed being hit even though the VM I was testing in definitely didn’t have a network device capable of hardware offload! The system didn’t crash or anything so it seemed like the buggy behavior wasn’t getting hit yet, but this was good progress.

And it was at this point that I realized I had never changed the example from the mailing list to actually include a dup expression. Oops again. After changing the rule to ip daddr 192.168.0.10 dup to eth0 though, my system annoyingly remained in a non-panicd state.

Before continuing, I also wanted to try running the nft commands after unshareing into a new user and network namespace (unshare -Urn) to see if it’s possible to reach this as an unprivileged user. Sure enough it was, making this bug potentially even more potent.

Back to the bug itself though: poking around through the nft man pages, I found you could pass -d netlink which ended up being incredibly useful as it showed the “disassembly” of the rule that was being sent to the kernel:

[ meta load protocol => reg 1 ]
[ cmp eq reg 1 0x00000008 ]
[ payload load 4b @ network header + 16 => reg 1 ]
[ cmp eq reg 1 0x0a00a8c0 ]
[ immediate reg 1 0x00000001 ]
[ dup sreg_dev 1 ]

From this, it’s apparent why the bug isn’t being triggered: the CLI generates an immediate expression before the dup (representing the device the packet should be duplicated to), so the accounting was “working”. Is it possible to have a dup without a preceeding immediate? I couldn’t find a way to have the CLI install a rule from this disassembled format (so couldn’t force it to generate dups with no immediates), so it was time to go deeper and manually create the packets to send to the subsystem.

Golang Implementation

I have a love/hate relationship with Go, but that’s a blog for another time. At the end of the day, it’s basically the only language that has a large community (and therefore a large selection of libraries) that’s low enough level to do what I need for this, but also high enough level to not make me want to throw my computer out the window while I’m trying to get something to work. So I started building a proof of concept in Go.

Conveniently, Google has a go nftables library which looked like a good starting point since I’d be able to manually construct the rule. Unfortunately, it didn’t expose quite everything I needed (mainly around setting the offload flag) and by the time I discovered this, I was a few hours into building around it and really didn’t want to rewrite it in C. I cobbled together some truely awful code which used reflection to overwrite the private array of messages to send, manually constructed the necessary chain creation message with the proper bit flipped, etc. etc. and another hour or so later I was back to where I started with the nft CLI.

I added another dup without an immediate before it, ran it and…

…

not much happened. It errored out with the normal “operation not permitted”, but nothing else. So at least it didn’t get rejected because of missing immediates which was good I guess?

Then, a few seconds later, kaboom. The kernel panicked and my shell was dead. We have a bug!

Now comes the fun part.

Exploitation

Analyzing what our bug actually provides us (with the help of pahole to get struct offsets), we see that there are 2 out of bounds writes:

entry = &flow->rule->action.entries[ctx->num_actions++];
entry->id = id;
entry->dev = dev;

The write of enum flow_action_id id immediately after the end of the array, writing the value 4 or 5 (depending on whether this is a fwd or dup) expression
The write of struct net_device *dev 24 bytes past the end of the array

As for the sizes of everything (on my Ubuntu test VM with a 5.13 kernel), the base flow_rule structure is 32 bytes and each additional entry in the array is 80 bytes. This means:

If there are no immediates in our rule, the size of the rule allocation will be 32 and will be allocated in the kmalloc-32 slab
One rule gives an allocation of size 112, landing in the kmalloc-128 slab
Two rules gives an allocation of size 192, landing in the kmalloc-192 slab
and so on

Focusing on the dev pointer write, the above allocation sizes means that the write will be either at offset 24 of the next 32- or 192-slab allocation, or at offset 8 of the next 128-slab allocation. I manually hunted around through pahole’s output looking for any interesting structure which had a pointer at the necessary offset, but came up empty handed. Everything that I found was either in a subsystem that required elevated privileges to access, in a subsystem that is “exotic” (probably not easily reachable), or in a subsystem which I felt was too flaky to try and land in (e.g. the scheduler).

Long story short, I put this aside and came back to it a couple days later with fresh eyes.

While reading through Alexander Popov’s writeup of another recent kernel bug looking for inspiration the thought occurred to me: we have the ability to cause multiple of these out of bounds writes, not just one (since multiple dups can be put in a rule). So in addition to hitting offset 8 of the next 128-slab allocation, we could also hit offset 88 of that allocation, or offset 40 of the 2nd next allocation, or offset 120 of the 2nd next, or…

Having just read that writeup in which Alexander uses the security pointer (at offset 40) to land a kfree, the exploit path became obvious.

What we do is:

Spray a bunch of System V message queue messages, causing the kernel to allocate a lot of msg_msg structures of a controlled size. For now, we care about landing in the kmalloc-128 slab
Free some of them
Add the netlink rule, causing the flow_rule allocation to hopefully land in one of the the just free’d heap slots
Do our OOB write a total of 3 times (i.e. have 3 dups in our rule with no immediate), clobbering
- The list_head.prev pointer (offset 8) of the next message on the heap
- Some random data (offset 88) in the contents of the next message on the heap
- The security pointer (offset 40) of the 2nd next message on the heap
Find and msgrcv the 2nd next message, causing the kernel to kfree() the net_device (since it was a net_device pointer that was written)
Allocate some more messages, but this time in the kmalloc-4k slab with the goal of landing in the net_device that was just free’d
Cause the kernel to do something on the device which would cause a function pointer in the (now controlled) net_device.netdev_ops operations struct to be called, giving us code execution. Reading from /proc/net/dev is a simple answer to this (causing netdev_ops->ndo_get_stats64 to be called) which is what I ended up using.

This chain is incredibly nice. Just to highlight a few benefits:

We know exactly which msg_msg had its list_head.prev pointer clobbered (and is therefore unsafe to free) since we can MSG_COPY it out of the queue (which wont touch the next/prev pointers since it’s not actually removed) and look to see if the contents of the message have changed.
In addition to telling us which message is “dangerous”, this also leaks the kernel heap pointer that we’re going to be landing in, making it trivial to start ROPing (more on this below).
We also know exactly which message had its security pointer overwritten. We could either add a 4th dup (and again look at message data after copying it), or we can look at the messages mtype after it’s copied out. Remember how 2 things are written out of bounds (4 or 5, and the pointer)? It just so happens that the 4 or 5 gets written over the message’s mtype (offset 16), so by checking if the mtype changed from whatever value was put in, we can tell if we have the right message.

By the end of the night (perhaps staying up a bit too late…), I had the first working proof of concept for this (in an ARM VM not x86, hence the different registers and whatnot).

Success!

A few more hours of hacking on this though, and I hadn’t gotten much closer to code execution.

For some reason, the exploit was incredibly flaky (i.e. it had a very low success rate). I figured this was either because:

The kernel freelist randomization was more potent than I thought against this
All of the stuff the go runtime does in the background was messing with the kernel heap.
Other things running on the system were causing sporadic kmalloc-128 allocations, throwing off/using up the freelist

I tried changing everything to work out of the kmalloc-2048 slab (since all of the offset math still works out), but this didn’t seem to help at all. At this point I probably should have spent some time with a kernel debugger, tracing exactly what was happening with the freelist, but I decided to go ahead and rewrite the exploit in C to see if that would help. If nothing else, it’d probably make later stages of the exploit much easier to work with since I wouldn’t have to try and link in some other thing that the kernel could jump to as a final stage of the exploit.

Rewriting

Boy was this a nightmare. There is a C library for “nicely” working with nftables, however at the end of the day it’s C so nothing is really “nice.” After many hours of staring at strace output of the netlink packets trying to figure out what I was missing in the C code, I eventually got back to where I was in goland. If you’re interested, the code necessary to interface with nftables is available in the reproducer I posted to the oss-security mailing list.

But it wasn’t any more stable. Damn.

Another couple days of messing around (mainly trying to figure out if there was a specific order in which to free the initial messages to best get around freelist randomization), I got to a point where the exploit was ~30% successful which was good enough to proceed with. It’s entirely possible I was missing something obviously broken in my exploit code, but if you have any ideas about something I could be missing kernel-side, please do drop me an email or DM - I would really like to know what’s going on.

Having spent enough time on this already, I decided to forgo making this into a full exploit. I just wanted to get my root shell and call it a day. I disabled SMEP, SMAP, KPTI, and KASLR on my test VM, and put together a quick “callback” (getting me root and out of any container/namespace) which I could jump directly to from the kernel:

void *get_task(void) {
    void *task;
    asm volatile ("movq %%gs: 0x1fbc0, %0":"=r"(task));
    return task;
}

void *elevate(void *dev, void *storage) {
    void *c = ((void * (*)(int))(prepare_kernel_cred))(0);
    ((void (*)(void *))(commit_creds))(c);
    void *current = get_task();
    ((void (*)(void *, void *))(switch_task_namespaces))(current, (void *)init_nsproxy);
    return NULL;
}

And that’s basically it. Minus the whole “it only works 30% of the time,” the exploit was done, and I got my shell after a few attempts.

And before you go burning cycles trying to crack that password hash, it’s just vagrant :P

Sidenote: ROP

While I didn’t end up implementing it in my exploit, we’re in an amazing position to ROP (making SMEP/SMAP/KPTI a non-issue). Since the kernel heap address of the net_device is leaked, we know where our message data is going to be in memory. That pointer can then be used to compute an address for our fake netdev_ops (putting it somewhere else in our message), and then when the kernel goes to call a function taken from that ops structure (with the net_device (/our message) as the first argument), we can give it the address of a simple mov rsp, rdi; ret gadget to stack pivot on to our message. From there, anything is possible.

The only thing missing is a KASLR leak, but that’s not much of a barrier :)

Code?

In the couple of weeks it took me to write up this blog post, @Bonfee already independently developed an exploit for the bug and published it!

I haven’t looked through the entirety of their implementation, but it seems to use a similar path to what I describe above. However, it also includes a full ROP chain and KASLR leak making it far more complete than mine. I’d recommend you check it out! https://github.com/Bonfee/CVE-2022-25636

Wrapping Up

This was a really fun bug to discover and work on. From start to end, it took just under a week to find, triage the bug, figure out how to hit it, and build the exploit. While not novel, the OOB write primitive we get with it is also pretty interesting, and makes for quite a clean exploit as we’ve seen.

I hope you’ve enjoyed reading, and of course reach out with any questions you may have.

A snapshotting kernel module for fuzzing

2021-12-10T00:00:00+00:00

Right as the pandemic was starting in March/April 2020, I spent a couple of weekends writing a Loadable Kernel Module (LKM) for Linux, designed to add a syscall which could be used by a fuzzer to quickly restore program state instead of using a conventional fork/exec loop. This was originally suggested on the AFL++ Ideas page, and it nicely intersected a bunch of stuff I’m familiar with so I wanted to take a crack at it.

My implementation can be found in the now archived GitHub repo: https://github.com/kallsyms/snapshot-lkm. It’s deprecated in favor of the AFL++ version, however as of Dec. 2020 that has also been frozen as it’s a significant amount of work to update it for each kernel version, since the module requires hooking some internal kernel functions which change frequently.

Overview

My initial work was heavily based on the original kernel patchset from the SSLab at Georgia Tech, which can be found here. I’d strongly recommend reading the original paper to understand more about how the innerds of the snapshotting work. To summarize though, the basic idea is to add a new syscall (snapshot()) which can either snapshot or restore the “important” bits of the current process so a new fuzz case can be run. This avoids the excessive overhead of a normal fork(), giving a very nice speed up versus conventional fuzzers.

Development Process

Understanding the Original Implementation

The first thing I needed to do was actually extract a diff/patch of what the paper implemented. The main repo is (unfortunately) a full fork of linux, but squashed so we can’t easily git diff to see what was implemented. A quick non-git diff against a freshly-cloned linux v4.8.10 repo quickly fixed that, giving us the patch.

I was surprised at how small it was.

There were only a total of 4 files that had meaningful changes which would affect normal program flow. The rest are either header files, syscall definitions, or the snapshot/restore implementation itself.

Breaking down each major function change:

file.c:dup_fd: when a files_struct (basically the set of file descriptors opened by a task) is duplicated (e.g. in fork()), the newly created files_struct needs to have its snapshot metadata initialized.
exit.c:do_group_exit: when a task exits as part of the entire group going down, snapshot metadata needs to be cleaned up.
exit.c:exit_group: when a task calls exit(), the snapshot is implicitly restored.
fork.c:dup_mm: when a task’s memory mappings are duplicated, the new mm_struct’s snapshot metadata needs to be initialized.
memory.c:do_wp_page: when a page fault occurs (when writing to a copy-on-write page), the snapshotting code may have some work to do.
memory.c:do_anonymous_page: when an anonymous (non-file-backed) page is accessed for the first time, the page that was mapped (added to a PTE) needs to be recorded, as the PTE may need to be restored.

With this understanding of what’s needed to “inject” into the kernel, let’s talk a bit about how I went about doing that.

Hooking: Kprobes

Linux has some crazy built-in tech that very few people know about. One of these is kernel probes, or kprobes. Kprobes are a way for things (be it a superuser in userland using the sysfs interface or another kernel module using the kernel interface) to, well, probe the kernel. You can set probe points on nearly any function in the kernel (even ones not EXPORTed for normal module use), and fetch values from the state at the time the probe is hit. And if you’re using the kernel-land interface (i.e. from a module), you can even overwrite registers (including the instruction pointer!) when your callback fires.

Almost everything in the snapshot process could be written “out-of-band” of the normal kernel functions (meaning it’s just observing what the kernel is doing and tracking state outside of any normal kernel structures), however in one place, the modifications cause a function to return early.

There’s a neat trick you can do with kprobes to emulate this behavior: set the instruction pointer to a stub function which immediately returns. Since that stub was never actually called (specifically, since no return instruction pointer was pushed to the stack), when that stub returns, it will pop off the return IP that the probed function should have returned to, effectively giving us a way to return early. This will only work if the probe is on the very first instruction of a function (otherwise the stack may have been expanded by the probed function), but this will be the case for us so we’re set. The docs have a bit more detail about what you actually need to do to achieve this with the kprobe subsystem.

Hooking: syscall table

In addition to the purely-additive things we need to run when certain kernel functions are called, we also need to completely hijack the exit syscall and add a new syscall entirely to do our snapshotting.

Side note: as the AFL++ devs did in their version, the snapshot operation should probably have been implemented as an ioctl instead. However, since I was treating this as a proof-of-concept and I already needed to do syscall table rewriting for exit() I figured I might as well do the same for snapshot(), and chose to overwrite the tuxcall() syscall since it’s completely unused.

Anyways, to get control over the syscalls we need to overwrite the syscall table which Linux uses to dispatch syscalls to their respective handlers. If the kernel is “nice” and has the sys_call_table as a named symbol, we can use that. In the case it doesn’t though, the quickest way I found to do this is find where in kernel memory the address of the read() syscall handler is immediately followed by the address of the write() syscall handler, since those are the first two syscalls. This is implemented in get_syscall_table.

The only other thing we need to do to hook the syscall table is make sure we make that memory writable before trying to overwrite it. And to do that, I decided to temporarily disable the write protect bit (bit 16) in cr0 instead of messing around with properly making the memory R/W. Again, proof-of-concept code :)

Implementation

Now, with all of that out of the way, let’s do a quick overview of the module implementation.

Starting at the (logical) top, in mod_init we grab the address of the syscall table, flip the WP bit in cr0, save the existing handlers, and overwrite the handler pointers with our own.

void **syscall_table = get_syscall_table();
...
_write_cr0(read_cr0() & (~(1 << 16)));
orig_sct_snapshot_entry = syscall_table[__NR_snapshot];
orig_sct_exit_group = syscall_table[__NR_exit_group];
syscall_table[__NR_snapshot] = &sys_snapshot;
syscall_table[__NR_exit_group] = &sys_exit_group;
_write_cr0(read_cr0() | (1 << 16));

Next, we hook the two functions we need (do_wp_page and page_add_new_anon_rmap) with their respective handlers. This uses a small wrapper I wrote which keeps track of all registered hooks so that we can cleanly tear them all down when the module unloads.

if (!try_hook("do_wp_page", &wp_page_hook))
...
if (!try_hook("page_add_new_anon_rmap", &do_anonymous_hook))
...

Lastly, we call into the main snapshotting code so it can do some initialization (just grabbing some addresses out of kallsyms).

return snapshot_initialize_k_funcs();

At this point, we’re initialized, our hooks are installed, and we’re ready for a “snapshot syscall aware” program to run.

From this point down, there’s really very little that was changed from the original patchset.

The only exceptions are:

When that program calls our snapshot syscall, it hits the handler which in turn dispatches either make_snapshot or recover_snapshot. Those functions are (IIRC) completely unmodified from the original patchset.
The hooks need to read out of the pt_regs passed in to grab the arguments that were actually passed to the hooked function (example).
The one place which requires us to return early overwrites the instruction pointer to a stub function as described above.

Wrapping Things Up

When I originally wrote back to the AFL++ maintainers about this, my implementation did “work”, but only for a few seconds before the kernel would oops. I suspected there was some locking that needs to occur that I wasn’t doing (because it’s always locking bugs), but I went ahead and passed this on to them, laying the groundwork for their (much improved) implementation. With that version working well, they were able to achieve >3x speedup in certain target programs which (if this was a more maintainable strategy) would be a great improvement. As they note in the README however, “due to syscall hooking and the never ending changes in the kernel we are unable to maintain it as we are busy working on libafl.”

Despite not being adopted, this was a very fun project to work on at the end of the day and a strategy that I feel like could be useful to other applications that need to make light modifications to the kernel.

DIY Environmental Monitor

2021-12-10T00:00:00+00:00

Early on in the pandemic, there was a good amount of discussion on Twitter about indoor CO2 levels as more people were spending time exclusively at home, often in a single, small room for hours on end. Since I was one of those people spending nearly the entire day in a single room, I decided to look around for a CO2 monitoring system. While a simple “alert after levels rise above x ppm” is sufficient, I was really looking for one that would be able to log data to a remote system so that I could monitor it throughout the day and/or look back on historical data from any computer. After being thoroughly disappointed with what was on Amazon (nothing at a reasonable price point seemed to be able to send data to a remote server), I decided it would be a nice little project to build my own.

Parts

Looking around on Adafruit, I settled on the SCD-30 - a combined CO2, temperature, and humidity sensor. As the Adafruit page said, this is a NDIR sensor so while it’s not the cheapest ($59), it’s actually measuring the CO2 in the air instead of approximating it from the concentration of volatile organic compounds (VOCs).

As for the “brains” of the device, I went with an ESP32-based development board. This gave me nice headers for all of the pins I’d need to get at (much like an Arduino or Raspberry Pi), but also includes WiFi out of the box so I could put it basically anywhere and not have to figure out how to get network to it.

I was also sure to grab the requisite cable to connect the sensor to the dev board.

Finally, as a last-minute addition, I grabbed a cheap light sensor (the VEML7700) since I figured that would also be fun to have logged. From the diagrams, it looked like it could be stacked directly on top of the SCD-30 with just some headers connecting the I2C and power pins, requiring no other changes.

Assembly

With everything in hand, the assembly was simple. Just as planned, I was able to solder the 0.1” headers included with the light sensor to connect GND and the I2C SCL/SCA pins between the SCD-30 and the VEML7700.

But where does power for the VEML come from you may ask? Well it’s a hack but the VEML board schematic shows that the 3.3V “output” pin is a direct connection to the 3.3V “plane” of that board (including Vin for the sensor itself), so in theory it’s safe to feed 3.3V from the SCD-30 into the VEML7700s 3.3V “out” and ignore the voltage regulator on the VEML entirely. With that also bridged with a header, all that was left was to connect the STEMMA cable from the SCD-30 to the ESP32 board and start writing code.

Firmware

After grabbing libraries and sample code from Adafruit for each of the sensors to make sure they were working, getting a basic “logger” working over serial output was trivial. As I mentioned above, the ESP32 has WiFi built in though, so I decided to use that to connect to an InfluxDB instance (just on Influx’s free cloud plan right now) and log everything there. Some more munging of sample code later, and I had basically the entire thing ready to go.

After getting it taped down on the side of a shelf (that should expose it to only indirect light for more accurate measurement) and let it run for a night.

However, when I checked the data the next day I found the readings were pretty far off.

At this point it was still July so I had the windows open most of the day which meant both that the temperature should be almost identical to what’s measured outside, and that the CO2 concentration should be ~400ppm. The temperature I was logging was nearly 2.5degC (~4.5degF) too high, and due to how the NDIR sensor works, that discrepancy was also affecting the CO2 reading. It looks like I’m not the only one with this issue, but either way the fix was quick - there’s a built-in temperature offset value that can be set from the ESP32.

Even after that was changed, the CO2 reading was still a bit off so I made one last change to the firmware adding a simple HTTP server. By hitting a specific route, I could remotely force a recalibration of the SCD-30 (basically on recalibration, the sensoe assumes it’s measuring ambient outside air with a CO2 concentration of 400ppm and adjusts its internal offset as appropriate).

Conclusion

The materials list and the full source for the firmware is available at https://github.com/kallsyms/environmental_sensor.

Overkilling Website Performance

2019-11-19T00:00:00+00:00

Given the recent series of issues with Google Cloud, I decided it was time to jump ship and look at other providers for this blog (and eventually the rest of my sites most likely).

Background

For the past year or so, I’ve been using GCP multi-region storage buckets with CloudFlare in front (for caching and TLS) to serve all of my static sites. I was never thrilled with the TTFB numbers I was getting out of the combo on un-cached pages however, and GCP having four relatively major outages in 6 months just pushed it over the edge for me.

I did some initial tests in AWS with a single S3 bucket and Cloudfront, and while results were slightly better, they still were not fantastic - times to load pages not in Cloudfront’s edge caches were around 60-70ms (vs ~100ms for GCS). Keep in mind however that those numbers are from a client in NYC with the site hosted in us-east-1 - one of the best cases possible (latency-wise). Visitors on the other side of the US would be right back to ~100ms load times, and visitors outside of the Americas would easily be 150ms+.

Not having much better to do one night, I decided to figure out how I could completely overkill site performance. My main concern was performance when cache misses happened (the majority of page loads on my site due to it not getting much traffic), and my objective was to get ~60ms response times on any page (cached or not) from anywhere in the world, not just within the US.

Sticking with Amazon as a provider here, there’s a few options I started exploring:

S3 + Cloudfront with extremely large TTLs

Based on the docs, Cloudfront employs a two-layer cache. POPs have their own, independent caches (very standard), however Cloudfront also has regional caches which, based on maps, seem to correspond to AWS regions. If an item is not in the POP’s cache, it will reach back to the regional cache, which can then return the object from there or go back to the origin if necessary. Since my site is not visited very frequently, it’s highly unlikely that any given page will be cached in the POP closest to the visitor (even with high TTLs), so I would be relying on regional caches keeping content basically indefinitely. In theory this layout could work (assuming regional caches effectively never expire items that are within their TTL), however this is not guaranteed, and it also means I would have to explicitly invalidate a number of pages each time I make a change to the site.

Multi-region S3 + ???

This was actually my first thought: just stick a copy of the site on each continent. While the content replication is easy to do within S3, there don’t seem to be any ways to make origin decisions in Cloudfront based on geolocation. The closest thing I found was to use a Lambda@Edge function to dynamically proxy the request based on geo, but then I got to thinking… If I already need to have a function at the edge to determine where to proxy incoming requests, could I just have the function return the site itself?

This reminded me of a blog post by CloudFlare which talks about deploying a static site to their edge using their Workers product (storing the site in their K/V store). I was curious to see if I could do something similar on Amazon, mainly because Workers has a $5/mo minimum price which I’d rather not have to pay if necessary.

Lambda@Edge

Amazon has a vaguely similar product to Workers called Lambda@Edge which, after a bit of reading, seems to be a bit of a misleading name (in my opinion). From what I can tell (based on docs and timing), the Lambda functions (at least for “Origin Request” triggered calls) are invoked in the nearest Amazon region, not at the POP/edge itself. Either way, if I can easily get the site contents stored in every Amazon region, that definitely gets me very close to the goal of delivering uncached pages in ~60ms anywhere on Earth.

A bonus of Lambdas that I only realized later is that the timing characteristics of Lambda functions ends up coinciding nicely with visitor usage. If a visitor comes along and are the first ones in a while in the entire AWS region, it will take a hundred milliseconds or so for the Lambda function to start up, which, while not ideal, also isn’t the worst thing since DNS resolution, initial TCP handshare, TLS, etc. will have also taken up a bit of time. Keep in mind this all only happens in the case that Cloudfront doesn’t have the page cached, either in the POP in in the regional cache which certain pages (the home page for instance) likely will be just due to background traffic. The interesting part about Lambdas is on subsequent requests. Requests to other pages (like the visitor clicking on a blog entry) are unlikely to be cached (since Cloudfront didn’t have the prior page cached), however we now have a running Lambda instance up that can serve requests in a couple of milliseconds. So regardless of where the user is, they will either hit a cached page in Cloudfront (taking basically round-trip time to respond), or will be proxied to a warmed Lambda instance through Cloudfront.

Comparing Performance

Between Solutions

While I only ended up implementing the full Lambda@Edge solution (and so don’t have concrete numbers for the others), we can make some deductions about relative performance:

vs. multi-region S3
- Even with a bucket in every region, the Lambda function would still have to do an intra-region request/response to fetch the content from S3
- If there was only one bucket per continent, there would be additional inter-region latency
vs. large TTLs
- Strictly better in the case of a complete cache miss (no round trip to the origin)
- Only worse on first page load in a region with no running Lambda

There’s also a few non-performance benefits to a pure Lambda@Edge solution (vs. large TTLs):

Don’t have to deal with invalidation on every site update
Eliminates the single point of failure of the origin (just in case https://aws.amazon.com/message/41926/ happens again)

Old vs. New

With all of that theory discussed, let’s look at some actual measurements (taken with TurboBytes Pulse):

The old method (GCS + CloudFlare) gave mean TTFBs of ~350ms on effectively every page unless it happened to be in CloudFlare’s cache.

The new method gives global average response times of ~210ms for the first connection in the region, and subsequent loads of uncached pages in ~70ms.

TTFBs of pages in POP caches average ~30ms on both.

Conclusion

With a bit more effort it should be possible to keep a Lambda instance warm in each region, which should completely eliminate the first page TTFB penalty, giving consistent uncached TTFBs of ~70ms. And with that, I’ve basically achieved my original goal (averaging just 10ms higher than I hoped for), significantly bringing down page load times and making the blog extremely snappy.

Will many people notice? No. But was it fun? Absolutely.