Warpspeed: Building a Record/Replay Debugger for macOS

This is a (long) overdue post to accompany the REcon 2023 talk Pete Markowsky and I gave talking about our work on Warpspeed: a time travel debugger for macOS.

What is time travel debugging?

Time travel debuggers (also called record/replay debuggers) add an extra “dimension” to normal debuggers - the ability to go back in time in a debugging session. This makes it significantly easier to determine causality - what first triggered the bug in question. In some cases, this may be very obvious (e.g. an inverted logic condition), but in others the underlying trigger may have happened long ago, in a different context, or even on a different thread entirely.

The original motivation to create Warpspeed in the first place came about from the pain of trying to debug a very rare crash in Santa. At seemingly random times, one of Santa’s non-critical processes would end up in what seemed like a physically impossible state. It took weeks of looking at this code on and off until we eventually figured out what was going on: an Objective-C callback was writing into a stack-local boolean which may not still be valid when the callback was run. In most cases, the bug didn’t trigger (since the write only happened on an error path), and even if the error was hit the *bool = true may have been completely innocuous flipping unused or unimportant bits depending on what other threads were running. But if a thread was running just the right thing when this callback was processed, the bool would be written into the stack of a now different thread, corrupting it, and causing the crashes we were seeing. These are the types of bugs where a time travel debugger makes it significantly easier to debug.

Prior Work

There have been time travel debuggers for decades at this point - it’s not a new idea. If you’re a Windows developer, you may have even used one before - WinDbg has had time travel debugging since 2017. On Linux, rr (which stemmed out of Mozilla) is probably the most well known, but there’s a handful more including DetTrace, more recently Hermit, and even GDB, though GDB’s built-in TTD has been… lacking… in my experience.

Windows/WinDbg has first-party magic involved to make it work (I’m actually not sure about the details of this ¯\_(ツ)_/¯), but the primary mechanism for all of Linux debuggers above is the ptrace subsystem which provides easy syscall interception plus trapping of non-deterministic CPU instructions (e.g. rdtsc). This combined with the ability to set thread affinity (preventing threads from running and/or forcing serialization) and hardware performance counters (to measure and reproduce events coming into the program from the outside world) makes it possible to build record/replay debuggers on Linux.

So what about macOS?

Why would we do this to ourselves?

macOS has a documented history of having lackluster ptrace. And by this I mean an almost non-existent implementation. It also doesn’t have any way to limit processes to specific cores, has limited PMC facilities, etc. Not off to a great start.

macOS (and the BSDs in general) do have dtrace however. We originally experimented with using dtrace to hook syscalls and traps and got pretty far with this, but eventually realized it wasn’t going to be the final answer for a couple reasons. We either had to have the dtrace program send a POSIX signal to the target (using raise()) and intercept that (using waitpid or similar), or use another dtrace call to suspend the mach task (using stop()) and poll for that change in another process since there’s no way to receive notifications of mach suspend happening. Both of these methods also only freeze the program after the syscall has run, meaning any memory collection of state before the syscall would need to be collected by the dtrace program itself. Lastly, dtrace also didn’t provide a way to force serialization (controlling thread preemption) so we were stuck.

We then played around with the idea of pure userland interception. The main realization here is that the macOS ABI isn’t at the syscall layer like it is on Linux, but rather at the libSystem layer - a dylib just like any other. Hooks could be added to libSystem_kernel to intercept all in userland (letting us gather any program state we could possibly need), but we still had issues with things like threading. We could intercept the various XNU thread creation calls and implement our own scheduler, but we still wouldn’t have visibility into early process start, making some things non-deterministic before we could get a chance to do anything (e.g. malloc entropy, pointer munge value, etc.)

Eventually after trying to figure out how else we could limit concurrency, we had another thought: what if we put the userland application into a VM? By definition the guest would have to trap out to receive any data which could be non-deterministic. This lets us put the entire app “in a box” and use normal VMM facilities to intercept and log events from the app. This also means we can completely control scheduling by just swapping vCPU register state between threads as is done normally by the kernel.

Late Night(s) Hacking

At first glance this might appear like a pretty simple thing to do - you just need to load the MachO into the VM’s physical address space and set up a trap to exit to the hypervisor when a syscall is performed, right? We even had a lot of reference code from the Darling Project implementing the loader, commpage, etc! Well…

I’m sure I’m missing stuff recounting this many months after the fact, but the point is: it was not easy.

After a couple weeks, I had gone through the above and had things mostly working. Now just to add recording of state before syscalls, after syscalls, persisting that to disk, and injecting it back in during a replay session…

In a time crunch (with only another few weeks until we were due to speak), we had another idea to keep implementation time down: instead of implementing semantics for each syscall and mach trap that macOS has, we can just “diff” memory before and after the host kernel handles a syscall and use that to automate most of the state recording. Since programs really only have 2 ways to get data (reading out of memory or performing a syscall), we only have to special case the things that can modify what memory is valid (e.g. mmap) - all memory reads/writes will be deterministic, and changes to the memory from syscalls are faked during replay. Since Warpspeed must special case syscalls which modify the memory maps (to map/unmap in the VM), it can recursively check if any potential register/memory address contains a valid pointer - a similar technique to how some garbage collectors work. For any addresses discovered, we snapshot the surrounding memory before the syscall and record the diff to the new memory after the syscall was processed.

With less than a day to spare, still writing code at the airport about to fly up to Montreal for the conference, this all finally came together for the demo: https://www.youtube.com/watch?v=Td5cQ6kGP5g.

Epilogue

After the talk, I refactored and cleaned up the VM management code and split it into a separate crate called AppBox. It contains the main logic for “putting apps in a box,” allowing for a number of other tools to be created on top. For example, one could implement something like gVisor for macOS using this library, limiting access to, or emulating bits of XNU. This division makes the entire Warpspeed record/replay system just an AppBox trap handler, responsible for the aforementioned pointer chasing, discovering, and recording of changed memory during syscalls.

If you want to play around with Warpspeed, clone the repo and run make to build and sign the CLI. You’ll need rust nightly installed. Then:

./target/release/warpspeed record -vvvv /tmp/trace /bin/ls -l

to record an execution of /bin/ls -l to the trace file /tmp/trace. Finally, replay it with

./target/release/warpspeed replay -vvvv /tmp/trace

And if all worked, you should be able to remove files, change directories, or do basically anything and still see the same output on replay.

The Future

There’s still plenty to do with Warpspeed - by no means is it production ready or even super useful. Replay of thread switching and external signals still needs to be implemented (along with all of the infrastructure to measure and break when some number of instructions has been run) and there are unanswered questions about how this should interact with the graphical components of macOS. However there should not be any technical limitations making this impossible, only a matter of (substantial) development time. Of which we have little right now 🙂