<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.7">Jekyll</generator><link href="https://www.nickgregory.me/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.nickgregory.me/" rel="alternate" type="text/html" /><updated>2025-03-23T16:14:20+00:00</updated><id>https://www.nickgregory.me/feed.xml</id><title type="html">Nick Gregory</title><author><name>Nick Gregory</name><email>nick@nickgregory.me</email></author><entry><title type="html">Bismuth: Building a Cloud</title><link href="https://www.nickgregory.me/post/2024/06/29/bismuth/" rel="alternate" type="text/html" title="Bismuth: Building a Cloud" /><published>2024-06-29T00:00:00+00:00</published><updated>2024-06-29T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2024/06/29/bismuth</id><content type="html" xml:base="https://www.nickgregory.me/post/2024/06/29/bismuth/"><![CDATA[<p>This year, a <a href="https://x.com/kinglycrow">good friend</a> and I have been working on a new startup: <a href="https://www.bismuth.cloud/">Bismuth</a>.</p>

<p>We’re building a cloud - not like AWS or GCP though which have hundreds of services to pick from, but an opionated platform which has the essentials built in, and is designed to make it fast and easy to go from 0 to 100.
There’s no setting up secrets manager, no configuring CloudTrail, nor even creating S3 buckets. Just one click to deploy your code (potentially written by our LLM), and you have implicit access to configuration &amp; secrets, a K/V service, and blob storage, with a couple more services coming soon.</p>

<p>I know I’ve had many times where I just want to put some code up somewhere and have it run, but every time I either ran into the aforementioned cloud “feature” of needing to setup 4 different services to serve “Hello World” in a Lambda, or had to spend hours dealing with CI/CD to get docker images building, stored somewhere, and finally actually running. And that’s all before we get into needing fancy things like <em>databases</em> :)</p>

<p>Please checkout the <a href="https://www.bismuth.cloud/">website</a>, <a href="https://www.bismuth.cloud/blog">blog</a> (for which we have some really good technical articles planned about our process building, and the internals of Bismuth), or <a href="https://app.bismuth.cloud/">give the platform a go</a>. Let us know what you think!</p>]]></content><author><name>Nick Gregory</name></author><category term="personal" /><category term="bismuth" /><summary type="html"><![CDATA[This year, a good friend and I have been working on a new startup: Bismuth.]]></summary></entry><entry><title type="html">Warpspeed: Building a Record/Replay Debugger for macOS</title><link href="https://www.nickgregory.me/post/2024/06/23/warpspeed/" rel="alternate" type="text/html" title="Warpspeed: Building a Record/Replay Debugger for macOS" /><published>2024-06-23T00:00:00+00:00</published><updated>2024-06-23T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2024/06/23/warpspeed</id><content type="html" xml:base="https://www.nickgregory.me/post/2024/06/23/warpspeed/"><![CDATA[<p>This is a (long) overdue post to accompany the <a href="https://www.youtube.com/watch?v=KYkHDQYJ6fg">REcon 2023 talk</a> <a href="https://twitter.com/PeteMarkowsky">Pete Markowsky</a> and I gave talking about our work on <em>Warpspeed</em>: a time travel debugger for macOS.</p>

<h1 id="what-is-time-travel-debugging">What is time travel debugging?</h1>

<p>Time travel debuggers (also called record/replay debuggers) add an extra “dimension” to normal debuggers - the ability to go back in time in a debugging session. This makes it significantly easier to determine <em>causality</em> - what first triggered the bug in question. In some cases, this may be very obvious (e.g. an inverted logic condition), but in others the underlying trigger may have happened long ago, in a different context, or even on a different thread entirely.</p>

<p>The original motivation to create Warpspeed in the first place came about from the pain of trying to debug a very rare crash in <a href="https://github.com/google/santa">Santa</a>. At seemingly random times, one of Santa’s non-critical processes would end up in what seemed like a physically impossible state. It took <em>weeks</em> of looking at this code on and off until we eventually figured out what was going on: an Objective-C callback was writing into a stack-local boolean which may not still be valid when the callback was run. In most cases, the bug didn’t trigger (since the write only happened on an error path), and even if the error was hit the <code class="highlighter-rouge">*bool = true</code> may have been completely innocuous flipping unused or unimportant bits depending on what other threads were running. But if a thread was running just the right thing when this callback was processed, the bool would be written into the stack of a now different thread, corrupting it, and causing the crashes we were seeing. These are the types of bugs where a time travel debugger makes it significantly easier to debug.</p>

<h2 id="prior-work">Prior Work</h2>

<p>There have been time travel debuggers for decades at this point - it’s not a new idea. If you’re a Windows developer, you may have even used one before - WinDbg has had time travel debugging since 2017. On Linux, <a href="https://rr-project.org/">rr</a> (which stemmed out of Mozilla) is probably the most well known, but there’s a handful more including <a href="https://github.com/dettrace/dettrace">DetTrace</a>, more recently <a href="https://github.com/facebookexperimental/hermit">Hermit</a>, and even GDB, though GDB’s built-in TTD has been… lacking… in my experience.</p>

<p>Windows/WinDbg has first-party magic involved to make it work (I’m actually not sure about the details of this ¯\_(ツ)_/¯), but the primary mechanism for all of Linux debuggers above is the <code class="highlighter-rouge">ptrace</code> subsystem which provides easy syscall interception plus trapping of non-deterministic CPU instructions (e.g. <code class="highlighter-rouge">rdtsc</code>). This combined with the ability to set thread affinity (preventing threads from running and/or forcing serialization) and hardware performance counters (to measure and reproduce events coming into the program from the outside world) makes it possible to build record/replay debuggers on Linux.</p>

<p>So what about macOS?</p>

<h1 id="why-would-we-do-this-to-ourselves">Why would we do this to ourselves?</h1>

<p>macOS has a <a href="http://uninformed.org/index.cgi?v=4&amp;a=3&amp;p=14">documented history</a> of having lackluster <code class="highlighter-rouge">ptrace</code>. And by this I mean an almost non-existent implementation. It also doesn’t have any way to limit processes to specific cores, has limited PMC facilities, etc. Not off to a great start.</p>

<p>macOS (and the BSDs in general) <em>do</em> have <code class="highlighter-rouge">dtrace</code> however. We originally experimented with using <code class="highlighter-rouge">dtrace</code> to hook syscalls and traps and got pretty far with this, but eventually realized it wasn’t going to be the final answer for a couple reasons. We either had to have the dtrace program send a POSIX signal to the target (using <code class="highlighter-rouge">raise()</code>) and intercept that (using <code class="highlighter-rouge">waitpid</code> or similar), or use another dtrace call to suspend the mach task (using <code class="highlighter-rouge">stop()</code>) and poll for that change in another process since there’s no way to receive notifications of mach suspend happening. Both of these methods also only freeze the program <em>after</em> the syscall has run, meaning any memory collection of state before the syscall would need to be collected by the dtrace program itself. Lastly, dtrace also didn’t provide a way to force serialization (controlling thread preemption) so we were stuck.</p>

<p>We then played around with the idea of pure userland interception. The main realization here is that the macOS ABI isn’t at the syscall layer like it is on Linux, but rather at the <code class="highlighter-rouge">libSystem</code> layer - a dylib just like any other. Hooks could be added to <code class="highlighter-rouge">libSystem_kernel</code> to intercept all in userland (letting us gather any program state we could possibly need), but we still had issues with things like threading. We could intercept the various XNU thread creation calls and implement our own scheduler, but we still wouldn’t have visibility into early process start, making some things non-deterministic before we could get a chance to do anything (e.g. malloc entropy, pointer munge value, etc.)</p>

<p>Eventually after trying to figure out how else we could limit concurrency, we had another thought: what if we put the userland application into a VM? By definition the guest would <em>have</em> to trap out to receive any data which could be non-deterministic. This lets us put the entire app “in a box” and use normal VMM facilities to intercept and log events from the app. This also means we can completely control scheduling by just swapping vCPU register state between threads as is done normally by the kernel.</p>

<h1 id="late-nights-hacking">Late Night(s) Hacking</h1>

<p>At first glance this might appear like a pretty simple thing to do - you just need to load the MachO into the VM’s physical address space and set up a trap to exit to the hypervisor when a syscall is performed, right? We even had a lot of reference code from the <a href="https://www.darlinghq.org/">Darling Project</a> implementing the loader, commpage, etc! Well…</p>

<ul>
  <li>We also need to load the dyld shared cache
    <ul>
      <li>Can we just map the running cache into the VM?
        <ul>
          <li>No not allowed by the kernel</li>
        </ul>
      </li>
      <li>Can we copy the running cache?
        <ul>
          <li>No because some globals are already initialized which confuses/breaks dyld init</li>
        </ul>
      </li>
      <li>Can we use <code class="highlighter-rouge">DYLD_SHARED_CACHE_DIR</code> to have the guest dyld try to map a new cache itself?
        <ul>
          <li>No because it basically just asks the kernel which says there’s already one in memory</li>
        </ul>
      </li>
      <li>Can we unpack the cache and ask dyld to load individual dylibs?
        <ul>
          <li>Yes but it doesn’t work because dylib fixups</li>
        </ul>
      </li>
      <li>Time to go implement a DSC loader…</li>
    </ul>
  </li>
  <li>… and we fault on some mem write in dyld (on an atomic I believe)
    <ul>
      <li>The ARM memory model requires virtual memory/page tables setup so the CPU knew how to treat the memory (i.e. (non)-gathering, (non)-reorderable, (non)-early write ack).</li>
      <li>Great now go implement page tables</li>
      <li>Debug an issue for 2 days due to a missing bit causing faults</li>
      <li>Eventually switched to using a lightly modified version of <a href="https://github.com/Impalabs/hyperpom">Hyperpom</a> which dealt with the page table management
        <ul>
          <li>And in the process rewrote the rest of the loading code in Rust (instead of the C it was in before)</li>
        </ul>
      </li>
    </ul>
  </li>
  <li>And finally, we need to setup a few pages for thread local storage</li>
</ul>

<p>I’m sure I’m missing stuff recounting this many months after the fact, but the point is: <em>it was not easy</em>.</p>

<p>After a couple weeks, I had gone through the above and had things mostly working. Now just to add recording of state before syscalls, after syscalls, persisting that to disk, and injecting it back in during a replay session…</p>

<p>In a time crunch (with only another few weeks until we were due to speak), we had another idea to keep implementation time down: instead of implementing semantics for each syscall and mach trap that macOS has, we can just “diff” memory before and after the host kernel handles a syscall and use that to automate most of the state recording. Since programs really only have 2 ways to get data (reading out of memory or performing a syscall), we only have to special case the things that can modify what memory is valid (e.g. <code class="highlighter-rouge">mmap</code>) - all memory reads/writes will be deterministic, and changes to the memory from syscalls are faked during replay. Since Warpspeed must special case syscalls which modify the memory maps (to map/unmap in the VM), it can recursively check if any potential register/memory address contains a valid pointer - a similar technique to how some garbage collectors work. For any addresses discovered, we snapshot the surrounding memory before the syscall and record the diff to the new memory after the syscall was processed.</p>

<p>With less than a day to spare, still writing code at the airport about to fly up to Montreal for the conference, this all finally came together for the demo: <a href="https://www.youtube.com/watch?v=Td5cQ6kGP5g">https://www.youtube.com/watch?v=Td5cQ6kGP5g</a>.</p>

<h1 id="epilogue">Epilogue</h1>

<p>After the talk, I refactored and cleaned up the VM management code and split it into a separate crate called <a href="https://github.com/kallsyms/appbox">AppBox</a>. It contains the main logic for “putting apps in a box,” allowing for a number of other tools to be created on top. For example, one could implement something like <a href="https://gvisor.dev/">gVisor</a> for macOS using this library, limiting access to, or emulating bits of XNU. This division makes the entire Warpspeed record/replay system just an AppBox trap handler, responsible for the aforementioned pointer chasing, discovering, and recording of changed memory during syscalls.</p>

<p>If you want to play around with Warpspeed, clone <a href="https://github.com/kallsyms/warpspeed">the repo</a> and run <code class="highlighter-rouge">make </code> to build and sign the CLI. You’ll need rust nightly installed. Then:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./target/release/warpspeed record -vvvv /tmp/trace /bin/ls -l
</code></pre></div></div>

<p>to record an execution of <code class="highlighter-rouge">/bin/ls -l</code> to the trace file <code class="highlighter-rouge">/tmp/trace</code>. Finally, replay it with</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./target/release/warpspeed replay -vvvv /tmp/trace
</code></pre></div></div>

<p>And if all worked, you should be able to remove files, change directories, or do basically anything and still see the same output on replay.</p>

<h2 id="the-future">The Future</h2>

<p>There’s still plenty to do with Warpspeed - by no means is it production ready or even super useful. Replay of thread switching and external signals still needs to be implemented (along with all of the infrastructure to measure and break when some number of instructions has been run) and there are unanswered questions about how this should interact with the graphical components of macOS. However there should not be any technical limitations making this impossible, only a matter of (substantial) development time. Of which we have little right now 🙂</p>]]></content><author><name>Nick Gregory, Pete Markowsky</name></author><category term="security" /><category term="macos" /><summary type="html"><![CDATA[This is a (long) overdue post to accompany the REcon 2023 talk Pete Markowsky and I gave talking about our work on Warpspeed: a time travel debugger for macOS.]]></summary></entry><entry><title type="html">Improving Fuzzing Speed with userfaultfd</title><link href="https://www.nickgregory.me/post/2022/12/09/uffd-fuzz/" rel="alternate" type="text/html" title="Improving Fuzzing Speed with userfaultfd" /><published>2022-12-09T00:00:00+00:00</published><updated>2022-12-09T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2022/12/09/uffd-fuzz</id><content type="html" xml:base="https://www.nickgregory.me/post/2022/12/09/uffd-fuzz/"><![CDATA[<p>About the same time I wrote up my <a href="https://nickgregory.me/post/2021/12/10/afl-kmod/">previous post about snapshot fuzzing</a>, I was thinking about other ways to restore program state for fuzzing, ideally in userland for ease of use.</p>

<p>There are of course many program side effects that need to be accounted for to restore program state perfectly: threads, files, timers, etc.
However those all have to interact with the kernel in a way that can be intercepted with either libc hooks or with syscall (seccomp) hacks.
And better yet for the purpose of fuzzing, they can often be disregarded and cleaned up in bulk after some large number of runs - for example, having extra open files shouldn’t break well-written programs.</p>

<p>The biggest challenge is restoring memory state since there’s no easy way to determine what memory has changed between runs from userland.
The kernel can do this without too much effort (see the previous post), but this information isn’t easily accessible to userland.
You could duplicate and restore <em>all</em> memory regions, however this doubles the running memory overhead of any fuzz target since it has to keep the pristine copy as well as the working copy. That may also end up taking a significant amount of time to reset between runs if there are hundreds of megabytes of shared libraries to be restored for example.</p>

<p>While looking at something entirely unrelated, I had the idea to use <a href="https://man7.org/linux/man-pages/man2/userfaultfd.2.html"><code class="highlighter-rouge">userfaultfd</code></a> to do this memory dirtyness tracking, which could then be restored on a page-granular level after the program finished running.</p>

<h1 id="userfaultfd">userfaultfd?</h1>

<p>In short, <a href="https://man7.org/linux/man-pages/man2/userfaultfd.2.html"><code class="highlighter-rouge">userfaultfd</code></a> is a newer Linux-specific interface for <em>user</em>land page<em>fault</em> handlers.
Instead of having a single SIGSEGV handler and tweaking memory protections, <code class="highlighter-rouge">userfaultfd</code> allows memory to be registered to a <code class="highlighter-rouge">userfaultfd</code> object which is then polled by another thread (or even another process) to respond to those faults.
It’s a much more flexible and performant way of handling page faults compared to a signal handler, and is perfect for our needs (except for a few small hindrances which can be worked around).</p>

<h1 id="implementation">Implementation</h1>

<p>To test the viability of this, I first created a minimal proof of concept which:</p>

<ol>
  <li>Duplicates all program memory (besides the relocation stub itself) into anonymous pages since <code class="highlighter-rouge">userfaultfd</code> cannot hook file-backed pages.</li>
  <li>Registers each writeable (now anonymous) page with a <code class="highlighter-rouge">userfaultfd</code> object in <a href="https://man7.org/linux/man-pages/man2/userfaultfd.2.html#:~:text=an%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20UFFDIO_ZEROPAGE%20ioctl.-,UFFDIO_REGISTER_MODE_WP,-(since%205.7)%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20When">write-protect mode</a>. When one of these pages is written to, its address and contents are added to a simple statically allocated array for later restoration and its write protect bit is disabled.</li>
  <li>Calls a target function/program.</li>
  <li>Restores memory by iterating over the array, copying out the previously saved “pristine” content.</li>
  <li>GOTO 3</li>
</ol>

<blockquote>
  <p>I’m glossing over some details here (e.g. having to switch stacks when entering the snapshot/restore code so that the program doesn’t fault on its own stack and hang), but that’s the high level overview.</p>
</blockquote>

<h2 id="benefits">Benefits</h2>

<p>This approach has many nice benefits, perhaps one of the most significant is actually something I didn’t realize until later.
With this method, since the dirty page list is kept between runs and the write protect bit is cleared after the first write to the page, the overhead of intercepting memory writes goes down as more iterations are run since the commonly written pages are already in the restore list and aren’t hooked in subsequent runs.
This means fewer kernel context switches and more time spent actually running the target code.</p>

<p>After a few days of hacking on this I got it working, targeting a simple program which did a <code class="highlighter-rouge">malloc</code> and printed out the returned address to show that heap restoration worked.
It also benchmarked quite nicely with restoration taking under 2 microseconds, encouraging me on.</p>

<p>This proof-of-concept code is available <a href="https://github.com/kallsyms/uffd-fuzz">here</a>.</p>

<h1 id="a-real-benchmark">A Real Benchmark</h1>

<p>As seems to be fuzzing tradition, I decided to ensure this worked on “real” programs by wrapping <code class="highlighter-rouge">libjpeg-turbo</code>. Specifically I was targeting <code class="highlighter-rouge">djpeg</code> converting <a href="https://upload.wikimedia.org/wikipedia/commons/5/56/Tux.jpg">an image of Tux</a> into decompressed form and printing out the output.</p>

<p>In addition to the <code class="highlighter-rouge">userfaultfd</code> proof of concept, I also wrote up samples for a few other common ways of doing fuzzing to do comparisons against:</p>

<ol>
  <li>A simple fork server which just calls <code class="highlighter-rouge">fork</code> in a loop and then exec’s the target</li>
  <li>The same as above, but using <a href="https://man7.org/linux/man-pages/man2/vfork.2.html"><code class="highlighter-rouge">vfork</code></a></li>
  <li>An “improved” fork server which is inline in the target program, allowing initialization to happen once and forking just before <code class="highlighter-rouge">main</code> is called - you may know this as persistent mode in AFL</li>
</ol>

<p>Code for the <code class="highlighter-rouge">userfaultfd</code> and persistent mode versions is avilable in the branches of <a href="https://github.com/kallsyms/uffd-fuzz-libjpeg-turbo">this repo</a>.</p>

<h2 id="results">Results</h2>

<p>For 10,000 iterations of <code class="highlighter-rouge">djpeg /tmp/tux.jpg</code>:</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Median (ns)</th>
      <th>Min (ns)</th>
      <th>Max (ns)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>fork</td>
      <td>475737</td>
      <td>457088</td>
      <td>898557</td>
    </tr>
    <tr>
      <td>vfork</td>
      <td>442299</td>
      <td>427317</td>
      <td>815917</td>
    </tr>
    <tr>
      <td>persistent</td>
      <td>321325</td>
      <td>311980</td>
      <td>1610773</td>
    </tr>
    <tr>
      <td><strong>userfaultfd</strong></td>
      <td><strong>174630</strong></td>
      <td><strong>166653</strong></td>
      <td><strong>734212</strong></td>
    </tr>
  </tbody>
</table>

<p>As these results show, even on a more complex program the <code class="highlighter-rouge">userfaultfd</code> technique resulted in a ~1.8x median performance increase over persistent mode, validating the idea!</p>

<h1 id="limitations">Limitations</h1>
<p>As with everything there are some limitations to the proof of concept:</p>

<p>Most notably, the write protect <code class="highlighter-rouge">userfaultfd</code> mode is currently implemented only for x86_64, meaning this approach will not work at all for ARM systems.
I don’t believe there’s any technical reason for this however, so it could be added in the future.
Additionally, this technique <em>could</em> be implemented by clearing <code class="highlighter-rouge">PROT_WRITE</code> on every page and setting a normal SIGSEGV handler, however this is notably slower (and much more annoying to implement) than <code class="highlighter-rouge">userfaultfd</code>.</p>

<p>Second, the fuzz framework would need to intercept <code class="highlighter-rouge">mmap</code> (or really any syscall which could alter memory mappings) and “do the right thing.”
<code class="highlighter-rouge">mprotect</code> also needs to be hooked, and any pages being marked <code class="highlighter-rouge">PROT_WRITE</code> would need to be added to the <code class="highlighter-rouge">userfaultfd</code> before the mprotect returns.
None of these are done in the proof of concept, but these wouldn’t be too hard to add.</p>

<h1 id="conclusion">Conclusion</h1>

<p>While there is still more to be done to create a full implementation, this proof of concept shows that the strategy of using <code class="highlighter-rouge">userfaultfd</code> to reset program memory is viable, and even works as-is on moderately complex software.
Being fully in userland, it should be possible to adopt this technique into source-available fuzzers like AFL(++) with relatively little maintenance work (compared to custom kernel modifications).
I don’t have the time to do that much unfortunately, but hopefully someone does!</p>

<p>As always, feel free to reach out with any questions, suggestions, or if you happen to implement this technique in a real fuzzer :)</p>]]></content><author><name>Nick Gregory</name></author><category term="security" /><summary type="html"><![CDATA[About the same time I wrote up my previous post about snapshot fuzzing, I was thinking about other ways to restore program state for fuzzing, ideally in userland for ease of use.]]></summary></entry><entry><title type="html">diffusion.gallery - A Constantly Changing Machine Generated Art Gallery</title><link href="https://www.nickgregory.me/post/2022/09/10/diffusion-gallery/" rel="alternate" type="text/html" title="diffusion.gallery - A Constantly Changing Machine Generated Art Gallery" /><published>2022-09-10T00:00:00+00:00</published><updated>2022-09-10T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2022/09/10/diffusion-gallery</id><content type="html" xml:base="https://www.nickgregory.me/post/2022/09/10/diffusion-gallery/"><![CDATA[<p>tl;dr: <a href="https://www.diffusion.gallery/">diffusion.gallery</a> is a website I put together which feeds random prompts from OpenAI into <a href="https://stability.ai/blog/stable-diffusion-public-release">Stable Diffusion</a>. It’s pretty neat.</p>

<h1 id="stable-diffusion">Stable Diffusion</h1>
<p>The past few weeks have been an exciting time for ML/AI (at least to an outside observer like myself).
There’s been a staggering amount of innovations and experiments around <a href="https://stability.ai/blog/stable-diffusion-public-release">Stable Diffusion</a>, a new model which can synthesize images from text (similar to <a href="https://openai.com/blog/dall-e/">DALL-E</a>) or even from other source images.
You can use it to <a href="https://thishousedoesnotexist.org/">generate houses</a>, <a href="https://www.reddit.com/r/StableDiffusion/comments/wsnlh4/how_to_draw_an_owl/">“draw the rest of the owl”</a>, or create some <a href="https://www.reddit.com/r/StableDiffusion/comments/x3u3r0/jeflon_zuckergates/">really cursed images</a> just to give a few examples.</p>

<h1 id="trying-it-out">Trying It Out</h1>
<p>I saw on Twitter that someone had <a href="https://github.com/bfirsh/stable-diffusion/tree/apple-silicon-mps-support">added support for Apple Silicon</a> to the image generation scripts, and since I daily-drive a M1 MacBook Pro that made it easy enough for me to try out.
One of the first things I asked it to generate was some concept art for a place called the <a href="https://www.destinypedia.com/Dreaming_City#Gallery">“Dreaming City”</a> from one of my favorite games, Destiny 2.
The results were really impressive, especially considering that I hadn’t done any prompt or parameter tuning:</p>

<p><img src="/images/stable-diffusion/grid-0021.png" alt="Dreaming City, Destiny 2" />
<img src="/images/stable-diffusion/grid-0017.png" alt="Dreaming City, Destiny 2" />
<img src="/images/stable-diffusion/grid-0019.png" alt="Dreaming City, Destiny 2" /></p>

<h1 id="the-idea">The Idea</h1>
<p>After messing around a bit more and seeing some of the things Stable Diffusion was able to produce, I thought it would be neat to have it constantly generating new art pieces and have them shown in a framed display on my wall, resembling art in a gallery.
The first idea I had was for it to randomly pick from a handful of subjects, environments, and styles and ask the model to generate that. However after mentioning it to a <a href="https://twitter.com/KinglyCrow2">friend</a>, he suggested I go “full AI” and have an OpenAI model generate the prompt which Stable Diffusion then generates the image for.</p>

<p>After a few hours of tinkering, the result is <a href="https://www.diffusion.gallery">diffusion.gallery</a>.</p>

<h1 id="diffusiongallery">diffusion.gallery</h1>

<p>Every 5 minutes a new prompt and image is generated and uploaded to the gallery. The page will automatically refresh so you can leave it up all day if you want - every time you switch to it, odds are there will be a brand new piece.</p>

<p>The bottom right shows a description card for the piece including its “author” (the model that generated the image), its “title” (the timestamp at which it was generated), and the prompt passed to Stable Diffusion which generated the image.</p>

<p>N.B. The images are created at a 16:9 aspect ratio (1024x576) for ideal viewing on normal widescreen monitors.</p>

<blockquote>
  <p>For those wondering: the odd resolution is due to the fact that the dimensions must be divisible by 64, and this is the only 16:9 resolution (below 1080p) for which this is true.</p>
</blockquote>

<h3 id="disclaimer">Disclaimer</h3>

<p>To try and avoid generating anything NSFW, the prompt to OpenAI explicitly requests that the resulting prompt (from which the image is generated) does not focus on any specific people.
Combined with the safety classifier built in to Stable Diffusion, I don’t <em>think</em> anything generated will be offensive, but obviously it’s still possible that something bad comes out.
Use the site at your own risk.</p>

<h1 id="some-pieces">Some Pieces</h1>
<p>After letting it run for just a few hours, I saw some really interesting images go by ranging from:</p>

<p>Hyper-realistic pictures</p>

<p><a href="https://www.diffusion.gallery/#1662508500"><img src="/images/stable-diffusion/1662508500.png" alt="A mostly barren landscape with a few jagged peaks in the distance. The sky is a harsh, unforgiving blue, and the air is cold and dry." /></a>
<em>Prompt: A mostly barren landscape with a few jagged peaks in the distance. The sky is a harsh, unforgiving blue, and the air is cold and dry.</em></p>

<p><br /></p>

<p>To dystopic drawings</p>

<p><a href="https://www.diffusion.gallery/#1662526500"><img src="/images/stable-diffusion/1662526500.png" alt="This painting is of an abandoned building in the middle of a dark desert. The subject is an old, crumbling building surrounded by nothing but empty, scorched earth. The painting is dark and moody, with a erie, atmospheric feel to it." /></a>
<em>Prompt: This painting is of an abandoned building in the middle of a dark desert. The subject is an old, crumbling building surrounded by nothing but empty, scorched earth. The painting is dark and moody, with a erie, atmospheric feel to it.</em></p>

<p><br /></p>

<p>To impressionist paintings</p>

<p><a href="https://www.diffusion.gallery/#1662519300"><img src="/images/stable-diffusion/1662519300.png" alt="This painting depicts a bustling city street full of people and their belongings. The scene is brightly lit and colorful, and the buildings in the background are sharply silhouetted against the sky. The painting is undoubtedly impressionistic, with a loose, free style that allows the various elements to share the spotlight." /></a>
<em>Prompt: This painting depicts a bustling city street full of people and their belongings. The scene is brightly lit and colorful, and the buildings in the background are sharply silhouetted against the sky. The painting is undoubtedly impressionistic, with a loose, free style that allows the various elements to share the spotlight.</em></p>

<p><br /></p>

<p>To the abstract</p>

<p><a href="https://www.diffusion.gallery/#1662520800"><img src="/images/stable-diffusion/1662520800.png" alt="This painting is of an abstract landscape with bright blues, greens, and oranges. It has a feeling of dynamism and energy, as if it is constantly moving. The style is impressionistic, with soft brush strokes and strong highlights." /></a>
<em>Prompt: This painting is of an abstract landscape with bright blues, greens, and oranges. It has a feeling of dynamism and energy, as if it is constantly moving. The style is impressionistic, with soft brush strokes and strong highlights.</em></p>

<p><br /></p>

<p>And finally, the best example of AI: a prompt confidently describing a non-existant painting with a full background story and artist</p>

<p><a href="https://www.diffusion.gallery/#1662507600"><img src="/images/stable-diffusion/1662507600.png" alt="Salmon Run is an iconic painting by American painter Robert Henri. It depicts a panoramic view of the Columbia River, with salmon leaping upriver to spawn. The painting was commissioned by the Oregon Railroad and Navigation Company in 1914 as a celebration of the railroad's 50th anniversary." /></a>
<em>Prompt: Salmon Run is an iconic painting by American painter Robert Henri. It depicts a panoramic view of the Columbia River, with salmon leaping upriver to spawn. The painting was commissioned by the Oregon Railroad and Navigation Company in 1914 as a celebration of the railroad’s 50th anniversary.</em></p>

<h1 id="framing-it">Framing It</h1>

<p>To close out the project, I picked up a <a href="https://www.amazon.com/dp/B09L12DGW5">thin 15.6” OLED display</a> and a cheap 11x17 frame. Some taping and wire management later, the gallery was now up on the wall.</p>

<p><img src="/images/stable-diffusion/framed.jpg" alt="Framed display showing diffusion.gallery" /></p>]]></content><author><name>Nick Gregory</name></author><category term="ai" /><summary type="html"><![CDATA[tl;dr: diffusion.gallery is a website I put together which feeds random prompts from OpenAI into Stable Diffusion. It’s pretty neat.]]></summary></entry><entry><title type="html">Using Graphs to Search for Code</title><link href="https://www.nickgregory.me/post/2022/07/02/go-code-as-a-graph/" rel="alternate" type="text/html" title="Using Graphs to Search for Code" /><published>2022-07-02T00:00:00+00:00</published><updated>2022-07-02T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2022/07/02/go-code-as-a-graph</id><content type="html" xml:base="https://www.nickgregory.me/post/2022/07/02/go-code-as-a-graph/"><![CDATA[<p>Some time ago, I was working on a server to generate images from weather RADAR data (a separate post on this will come at some point).
As part of this, I spent a few hours profiling my code and found a tiny “bug” in the open source library I was using to parse one type of RADAR data.</p>

<p>To summarize, the library was doing</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, &amp;data)
</code></pre></div></div>

<p>when an equivalent but much faster way of doing this is</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data := make([]uint8, ldm)
binary.Read(r, binary.BigEndian, data)
</code></pre></div></div>

<p>Notice the difference?</p>

<p>Passing <code class="highlighter-rouge">&amp;data</code> instead of just <code class="highlighter-rouge">data</code> caused <code class="highlighter-rouge">binary.Read</code> to take <em>nearly twice as long</em>, and the function this was in was responsible for the vast majority of the request runtime. That one character decreased throughput by nearly 40%!</p>

<blockquote>
  <p>Aside: you may be wondering, why is passing a pointer to the array so much slower?
It’s because Go’s binary.Read has a <a href="https://cs.opensource.google/go/go/+/refs/tags/go1.18.3:src/encoding/binary/binary.go;l=192">fast-path for array types</a>, but a pointer to an array ends up taking a much slower <code class="highlighter-rouge">reflect</code> based path.</p>
</blockquote>

<p>Finding this got me thinking: this seems like something that could very easily slip in to other code. Is there any existing way to <em>quickly</em> search over code when you find a bug or “anti-pattern” like this?</p>

<h2 id="existing-tools">Existing Tools</h2>

<p>There’s three broad categories that existing tools seem to fit in:</p>

<ol>
  <li>Highly advanced program analysis toolkits which build per-project databases (potentially compiling them in the process):
    <ul>
      <li><a href="https://codeql.github.com/">CodeQL</a></li>
      <li><a href="https://joern.io/">Joern</a></li>
    </ul>
  </li>
  <li>AST-based matching tools
    <ul>
      <li><a href="https://semgrep.dev/">Semgrep</a></li>
      <li><a href="https://github.com/googleprojectzero/weggli">Weggli</a></li>
    </ul>
  </li>
  <li>“Simple” grep-like tools
    <ul>
      <li><a href="https://sourcegraph.com/">Sourcegraph</a></li>
      <li><code class="highlighter-rouge">grep</code></li>
    </ul>
  </li>
</ol>

<p>None of these address everything that’s needed to do what I want though.
The program analysis toolkits by their nature don’t allow for easy querying for “all uses of symbol X across all repos” without consuming a <em>ton</em> of compute resources running the query on each project independently.
Similarly, the simple AST tools <em>may</em> have enough information to run the query (if they do basic type deduction), but as far as I know, are all built as “interactive” tools that once again only run on one repo at a time and don’t index anything.
Lastly, the purely textual tools can have indexes built, but they don’t have the type information required to check for the “argument is a pointer to an array” part of our query.</p>

<p>I wanted something that has the intelligence of the first group (type information, data flow, import resolution, etc.) with the scalability of the second (tens of thousands of projects queryable in seconds).</p>

<p>So I wrote it! Introducing <a href="https://github.com/kallsyms/go-graph">go-graph</a>, because I’m bad at naming things.</p>

<h1 id="goals">Goals</h1>
<h2 id="overview">Overview</h2>

<p>For now, this project is only concerned with Go code. Why?</p>

<ul>
  <li>Go has a large open source community, so there’s plenty of code to search</li>
  <li>Go is a simple language (for good or for bad), which makes our AST parsing easy
    <ul>
      <li>Go also ships with all of the libraries we need to parse and analyze Go code because the compiler is self-hosted</li>
    </ul>
  </li>
  <li>Go’s lack of a preprocessor means that a specific file in a package can only ever be compiled one way - there’s no chance for an <code class="highlighter-rouge">#ifdef</code> or similar to change the file</li>
  <li>Lastly, but perhaps most importantly: the motivating bug was in Go code and I wanted to keep the scope of this project relatively tight since there’s basically no upper bound on how complex it could get</li>
</ul>

<p>Now with that established, let’s talk about how it works.</p>

<h2 id="implementation">Implementation</h2>

<p>First, the schema. go-graph indexes:</p>

<ul>
  <li>Metadata (source URL, and version) about each indexed Go package</li>
  <li>The functions in each package</li>
  <li>The statements in each function, their raw source text, and their successor(s) (i.e. the control flow graph)</li>
  <li>Any function calls which happened in a statement, with the resolved target(s) of the call</li>
  <li>Variables (name and type), as well as relations from each variable to where it’s defined and to all statements in which it’s referenced</li>
</ul>

<p>This is more than enough to be able to run the original motivating query, and even allows for some more complex queries like searching for <a href="https://en.wikipedia.org/wiki/Program_slicing">program slices</a> that match some criteria (as we’ll see later).</p>

<h3 id="storage">Storage</h3>

<p>Next, I had to decide how to store all of this data. A graph database seemed natural given the, well, graph structure of all of this information.</p>

<blockquote>
  <p>You could say I’m building a source-graph, but that name was already taken :)</p>
</blockquote>

<p>Additionally, graph query languages are specifically built to let us easily write “multi-hop” (or even fully recursive) queries which is going to be a necessity moving between functions, their call sites, the statements for those call sites, the variables referenced in those statements, etc.
This <em>could</em> all be done in a relational DB, but a graph DB is a much more natural starting point.</p>

<p>So which graph database to use?</p>

<p>I had been wanting to try out Janusgraph for a while, so I did an initial implementation using it mainly out of curiosity. It did work, however indexing was somewhat bottlenecked (topping out at ~80k vertices or edges per second, dipping as low as 20k/s), and the Go libraries for working with Gremlin queries (the language Janusgraph uses) are not amazing.</p>

<p>As part of a series of optimizations I did after initial development, I was looking at Neo4j (as it’s more or less the industry standard as far as I know), but ran across a <a href="https://community.neo4j.com/t5/neo4j-graph-platform/tuning-for-larger-than-memory-multiple-tb-graph-node-insertion/td-p/30368">forum post by</a> <a href="https://unhexium.net/research/neo4j-performance-adventures-for-petabyte-scale-datasets/">by Ben Klein</a> who was having issues with doing large-scale bulk inserts.</p>

<p>I mention this in particular because it just so happens the project that forum post talks about using Neo4j for is <a href="https://github.com/utk-se/WorldSyntaxTree">WorldSyntaxTree</a>. The goal of that project is, in short, representing repositories, files, and parsed <a href="https://github.com/tree-sitter/tree-sitter">tree-sitter</a> ASTs in a graph database - similar enough to what I’m doing that I figured if they were having issues with Neo4j, I probably would as well. Luckily for me, they had already tested and decided on another database, <a href="https://www.arangodb.com/">ArangoDB</a>, so I followed in their footsteps.</p>

<h1 id="results">Results</h1>
<h2 id="indexing">Indexing</h2>
<p>After hacking on this on and off for a few weeks then coming back a few months later to do the aforementioned performance optimizations, I had a working version which did everything I needed.</p>

<p>I shallow cloned all Go repos on GitHub with &gt;=100 stars (11,659 of them) giving me about 330GB of source, and started indexing.
Before the optimization work, this took about 1.5 weeks to ingest, but as of now, all 11k repos can be indexed in about 11 hours, resulting in an ArangoDB data directory of just under 200GB. For reference, the Go indexing code was running on an 8 core Xeon with 32GB of RAM, and the database was running in a VM on a machine with a 10 core i9, also with 32GB of RAM.</p>

<p>Some fun statistics from the ingest process:</p>
<ul>
  <li>Network traffic to the database averaged about 50Mbps, peaking at nearly 200Mbps</li>
  <li>After all was done, the database had:
    <ul>
      <li>219k distinct packages, 420k unique (package,version) pairs</li>
      <li>27M functions</li>
      <li>450M function calls</li>
      <li>190M statements</li>
      <li>126M variables</li>
      <li>91M variable assigns</li>
      <li>360M variable reference</li>
      <li>185M statement “next” edges</li>
    </ul>
  </li>
</ul>

<h2 id="other-tools">Other Tools</h2>
<p>I didn’t bother spending the time building CodeQL or Joern databases for each project as I’m almost certain that would have taken longer than my indexing did and I know the query times definitely wouldn’t satisfy the seconds-to-minutes requirement. Just starting/initializing either system takes multiple seconds, times tens of thousands of repos puts any queries well into the tens of minutes to hours range.</p>

<p>With those out, the only other tool that had all of the information needed to execute the original query was Semgrep.
I let it rip over all of the code with this rule:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rules:
- id: untitled_rule
  pattern: |
      binary.Read(..., &amp;($X : []$Y))
  message: Semgrep found a match
  languages: [go]
  severity: WARNING
</code></pre></div></div>

<p>aaaandd it died:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time semgrep --metrics=off -c /tmp/semgrep_rule.yaml --verbose
...
====[ BEGIN error trace ]====
Raised at Stdlib__map.Make.find in file "map.ml", line 137, characters 10-25
Called from Sexplib0__Sexp_conv.Exn_converter.find_auto in file "src/sexp_conv.ml", line 156, characters 10-37
=====[ END error trace ]=====
...
3374.64s user 505.06s system 143% cpu 44:59.79 total
</code></pre></div></div>

<p>Running semgrep for each repo individually <em>did</em> work however, and it yielded a run time of about 2h15m:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time for d in *; do semgrep --metrics=off -c /tmp/semgrep_rule.yaml $d; done
...
8210.36s user 1246.15s system 115% cpu 2:16:59.63 total
</code></pre></div></div>

<p>It looks like some of semgrep’s initialization is single-threaded, so giving it the benefit of the doubt, we could expect ~13 minute query times if I had run 10 instances of this in parallel.</p>

<h2 id="go-graph">go-graph</h2>

<p>I used the following Arango query to perform the search:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FOR p IN package
FILTER p.SourceURL == "encoding/binary"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "Read"
FOR callsite IN INBOUND f Callee
FOR statement IN OUTBOUND callsite CallSiteStatement
FOR var IN OUTBOUND statement References
FILTER STARTS_WITH(var.Type, "[]")
FILTER CONTAINS(statement.Text, CONCAT("&amp;", var.Name))
FOR callfunc in INBOUND statement Statement
FOR callpkg in INBOUND callfunc Functions
RETURN {package: callpkg.SourceURL, file: statement.File, text: statement.Text, var: var.Name, type: var.Type}
</code></pre></div></div>

<p>Writing this out in English:</p>
<ol>
  <li>Take the <code class="highlighter-rouge">encoding/binary</code> package</li>
  <li>Traverse out to Functions named <code class="highlighter-rouge">Read</code></li>
  <li>Traverse to the call sites of <code class="highlighter-rouge">Read</code>, and out to the statement in which the call occurred</li>
  <li>Traverse out to the variable referenced within that statement</li>
  <li>Filter for variables whose type starts with <code class="highlighter-rouge">[]</code> and which appear after a <code class="highlighter-rouge">&amp;</code> in the raw statement text</li>
  <li>Traverse out from the call statement to the containing function, and then to the package containing said function</li>
  <li>Return the package, file, and statement in which the call happened, and the variable name and type which was passed incorrectly</li>
</ol>

<p>And the performance?</p>

<p>…drumroll…</p>

<p>This query ran in only ~20s! Exactly in the timeframe I was looking for. Not quite “interactive” but also quick enough that you can iterate on queries without losing your train of thought.</p>

<h1 id="another-one">Another one</h1>

<p>One other potential use case that’s interested me is using go-graph to do “poor mans data flow analysis”, mainly to find examples of how you can go from one function/data type to another.</p>

<p>This is a somewhat contrived example, but let’s find all uses of <code class="highlighter-rouge">crypto/rsa.GenerateKey</code> where the result flows through 0 or more intermediary variables to be used a <code class="highlighter-rouge">pem.Encode</code> call:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// find calls to crypto/rsa.GenerateKey
FOR p IN package
FILTER p.SourceURL == "crypto/rsa"
FOR f IN OUTBOUND p Functions
FILTER f.Name == "GenerateKey"
FOR call IN INBOUND f Callee
FOR srccallstmt IN OUTBOUND call CallSiteStatement

// Walk assign-&gt;ref-&gt;assign-&gt;ref-&gt;...
// until we reach a statement with an interesting call.
// v "alternates" between being a variable and being a statement

FOR v, e, path IN 1..5 OUTBOUND srccallstmt Assigns, INBOUND References
    PRUNE CONTAINS(v.Text, "Encode")
    OPTIONS {uniqueVertices: "path"}

// ensure that the end vertex is where we want
// quick check before doing any traversals
FILTER CONTAINS(v.Text, "Encode")
// now walk to the call site, called func,
// and ensure it's actually encoding/pem.Encode
FOR dstcallstmt IN INBOUND v CallSiteStatement
FOR dstcallfunc IN OUTBOUND dstcallstmt Callee
FILTER dstcallfunc.Name == "Encode"
FOR dstcallpkg IN INBOUND dstcallfunc Functions
FILTER dstcallpkg.SourceURL == "encoding/pem"

// ensure the "reference" is not actually an assignment
// go-graph considers a variable to be referenced
// even if it's on the left-hand side of an assignment
// which means `x, err := GenerateKey(); y, err := bar; Encode(y)`
// would match without this last filter since `err` is assigned
// in the first statement then also considered as "referenced"
// in the second
FILTER LENGTH(
    FOR stmt IN path.vertices
    FILTER IS_SAME_COLLECTION(statement, stmt)
    FOR checkassign IN OUTBOUND stmt Assigns
    FOR target IN path.vertices
    FILTER IS_SAME_COLLECTION(variable, target)
    FILTER POSITION(path.vertices, target, true) &lt; POSITION(path.vertices, stmt, true)
    FILTER target == checkassign
    RETURN checkassign
) == 0
RETURN path
</code></pre></div></div>

<p>This query might be more complex than you’d expect, but despite that it works in a relatively timely fasion. 1000 results takes 2 seconds for up to 1 intermediate variable, 8 seconds for up to 2 intermediates, and 30 seconds for up to 3. For example, it returned <a href="https://github.com/go-gitea/gitea/blob/main/cmd/cert.go#L102">this code in Gitea</a> (where I originally got the idea to test from <code class="highlighter-rouge">crypto/rsa.GenerateKey</code> to <code class="highlighter-rouge">pem.Encode</code>):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>priv, err = rsa.GenerateKey(rand.Reader, c.Int("rsa-bits"))
...
derBytes, err := x509.CreateCertificate(rand.Reader, &amp;template, &amp;template, publicKey(priv), priv)
...
err = pem.Encode(certOut, &amp;pem.Block{Type: "CERTIFICATE", Bytes: derBytes})
</code></pre></div></div>

<p>But it also found a longer chain in a <a href="https://github.com/kubernetes/client-go/blob/master/util/cert/cert.go#L115">kubernetes cert utility</a>:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>caKey, err := rsa.GenerateKey(cryptorand.Reader, 2048)
...
caDERBytes, err := x509.CreateCertificate(cryptorand.Reader, &amp;caTemplate, &amp;caTemplate, &amp;caKey.PublicKey, caKey)
...
caCertificate, err := x509.ParseCertificate(caDERBytes)
...
derBytes, err := x509.CreateCertificate(cryptorand.Reader, &amp;template, caCertificate, &amp;priv.PublicKey, caKey)
...
err := pem.Encode(&amp;certBuffer, &amp;pem.Block{Type: CertificateBlockType, Bytes: derBytes})
</code></pre></div></div>

<p>This ability could help you when you have a data source (e.g. <code class="highlighter-rouge">GenerateKey</code>) and know where the data needs to end up (e.g. written to a PEM file), but can’t find any examples of what function(s) are needed to convert between (<code class="highlighter-rouge">x509.CreateCertificate</code> in this case).</p>

<h1 id="improvements">Improvements</h1>

<p>go-graph currently doesn’t keep a graph of the full AST so there’s currently no “pure graph” way to find when a reference to a variable is taken. It wouldn’t be too hard to add, however I knew the trick of looking for <code class="highlighter-rouge">&amp;{variableName}</code> would work so I didn’t bother implementing it for now. Similarly, the position/context of a variable reference is not saved anywhere, so (barring more string comparisons) you’re unable to specify something like “&amp;x is the 3rd argument”.</p>

<p>There are also “semantically equivalent” variants of the first query which aren’t captured. For instance:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data := make([]uint8, ldm)
data2 := &amp;data
binary.Read(r, binary.BigEndian, data2)
</code></pre></div></div>

<p>For the original motivating query though, it doesn’t really matter as the mistake is going to be adding a <code class="highlighter-rouge">&amp;</code> in to the <code class="highlighter-rouge">binary.Read</code> call, not passing a pre-existing variable of type <code class="highlighter-rouge">*[]whatever</code>. It’s also entirely possible to change the query to look for that case of course if that is desired.</p>

<h1 id="closing-words">Closing Words</h1>

<p>This level of search definitely isn’t for every use case, but it does fit nicely into a slot that I haven’t seen any other project fill.
There’s a <em>lot</em> of details in the implementation not covered here, but they’re not really relevant to the overarching goal.
I encourage you to go <a href="https://github.com/kallsyms/go-graph">check out the code</a> if you’re interested in the nitty gritty.
I’m also happy to provide tarballs of the cloned repos and/or indexed database if anyone would like them to experiment on without having to clone and index everything themselves.</p>

<p>I hope you enjoyed reading! As always, feel free to drop me an email with any questions, suggestions, or other ideas.</p>]]></content><author><name>Nick Gregory</name></author><category term="programming" /><category term="devtools" /><summary type="html"><![CDATA[Some time ago, I was working on a server to generate images from weather RADAR data (a separate post on this will come at some point). As part of this, I spent a few hours profiling my code and found a tiny “bug” in the open source library I was using to parse one type of RADAR data.]]></summary></entry><entry><title type="html">Seeing the Clouds with the Cloud</title><link href="https://www.nickgregory.me/post/2022/06/03/azure-orbital/" rel="alternate" type="text/html" title="Seeing the Clouds with the Cloud" /><published>2022-06-03T00:00:00+00:00</published><updated>2022-06-03T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2022/06/03/azure-orbital</id><content type="html" xml:base="https://www.nickgregory.me/post/2022/06/03/azure-orbital/"><![CDATA[<p>If you follow AWS closely, you may have heard about a niche product launch a few years back called <a href="https://aws.amazon.com/ground-station/">Ground Station</a> which lets you rent, well, a ground station (basically a big antenna plus supporting equipment to communicate with satellites).
A friend recently linked me an <a href="https://aws.amazon.com/blogs/publicsector/earth-observation-using-aws-ground-station/">AWS blog post</a> with a sample use case which described using it as a way of receiving real time imagery from orbiting weather satellites.
Now funny enough, receiving data from polar orbiting weather satellites has been a side project of mine for over a decade now, but living in NYC has put a bit of a hold on it. I used to have a <a href="https://www.instructables.com/NOAA-Satellite-Signals-with-a-PVC-QFH-Antenna-and-/">home-built QFH antenna</a> which I used to receive images with a surprisingly high success rate given the janky construction of it.</p>

<p><img src="/images/azure-orbital/qfh_antenna.jpg" alt="The antenna" style="width: 40%" /></p>

<p>Yes, you’re seeing that correctly - it’s an antenna made of PVC tubing and coax duct-taped to the top of a pole for a basketball hoop. Crude but effective.</p>

<p>Anyways, the ability to use a remote antenna to downlink imagery piqued my interest, especially since these antennas would let me get the highest quality digital imagary sent out in the 8GHz <a href="https://en.wikipedia.org/wiki/X_band">X-band</a> instead of the lower-quality analog <a href="https://www.sigidwiki.com/wiki/Automatic_Picture_Transmission_(APT)">APT</a> transmissions around 137MHz that I had received in the past. So I set out to try and downlink a “true color” image.</p>

<p>I requested access to AWS ground station, but also found out about and filled out a request form to get access to <a href="https://azure.microsoft.com/en-us/services/orbital/">Azure Orbital</a> - Microsoft’s competing offering which is still in preview.</p>

<h1 id="onboarding">Onboarding</h1>

<p>I never ended up hearing back from AWS after an initial email from them requesting details about my use case, however this is probably for the best as it costs $10/min to rent one of their antennas. With one pass of a polar orbiting satellite lasting anywhere from ~8-15 minutes, this would have gotten <em>really</em> expensive to be playing around on.</p>

<p>Since Azure Orbital was still in preview though, it was free to use! The Orbital team onboarded me to the preview quickly, however after a bit of back and forth trying to figure out why I was getting an error when a “contact” was supposed to start, I found out that they were only allowing downlinking from the NOAA AQUA satellite, not the weather-specific polar orbiting satellites (e.g. NOAA-20).
This was fine though, as at the end of the day this was just an experiment and I had no need for the weather satellites in particular.</p>

<h1 id="trial-and-error">Trial and Error</h1>

<p>While Azure was great about getting me on the platform, their docs were… lacking to say the least.
It looks like they’ve added a small <a href="https://docs.microsoft.com/en-us/azure/orbital/howto-downlink-aqua">how-to guide</a> in the months since I was experimenting which explains some of the questions I had, however it still doesn’t cover the last phase of actually demodulating and decoding the signal into usable data.
It’s understandable since that’s not “relevant” to the service, but what good is it to receive data without doing something with it!</p>

<p>In case it helps anyone else though, I’ve put the questions I had and the answers the Orbital team gave back to me <a href="#appendix-orbital-qa">down below</a>.</p>

<h1 id="data-ingest">Data Ingest</h1>

<h2 id="preface">Preface</h2>
<p>Before diving in to details, I want to quickly go over the process for transforming radio signals into data, at least as it applies to receiving data from AQUA using Azure Orbital.</p>

<ol>
  <li>RF data is received by an antenna, digitized, and transmitted over the network as a series of <a href="https://www.tek.com/en/blog/quadrature-iq-signals-explained"><strong>I/Q data</strong></a> encapsulated in “VITA-49” packets. These packets include the raw data as well as a bunch of other metadata from the receiving system (things like timestamps, receiver gain(s), configured intermediate frequency, etc.).</li>
  <li>The I/Q data is <strong>demodulated</strong>, in our case as a <a href="https://www.allaboutcircuits.com/technical-articles/quadrature-phase-shift-keying-qpsk-modulation/">QPSK</a> signal. This transforms the RF stream into one of four possible <em>symbols</em>.</li>
  <li>The demodulated <em>symbols</em> are then <strong>decoded</strong> into complete data <em>frames</em>. This is where the data is first interpreted (synchronizing to the start of frames).</li>
  <li>The <em>frames</em> are then checked for errors, the headers are parsed, and the frames are separated by the “virtual channel” (so data from multiple instruments can all share a common downlink and even be interleaved) then dispatched for processing.</li>
</ol>

<p>There’s a lot to unpack in that if you’re new to this, but hopefully it helps make the rest of the blog at least a bit more understandable!</p>

<h2 id="receiving">Receiving</h2>
<p>As the Q&amp;A at the end mentions, Orbital <em>can</em> do the demodulation and/or decoding for us, however the format for specifying how to do that is proprietary to a specific brand of modem (Kratos).
Googling around a bit didn’t show any public documentation and I didn’t really feel like contacting them to try and get access to docs, so we’ll have to do this ourselves in software from the raw RF data.</p>

<p>First off, I setup <code class="highlighter-rouge">socat</code> on a small Azure VM to receive the data from the Orbital service and dump it to a file. I could have setup the listener on one of my personal machines, however given the bandwidth required to receive (~300Mbps) and the fact that I was on the opposite coast of the U.S., I opted to use a VM local to the Azure region the receiver was in to stage the data first.</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ socat TCP-LISTEN:1234,reuseaddr,fork 'SYSTEM:cat &gt; raw-$(date +%s).dat'
</code></pre></div></div>

<p>After a pass, I manually uploaded the <code class="highlighter-rouge">.dat</code> file to object storage so that I could retrieve them later for processing on my machines at home.</p>

<p>Next, I needed to extract the raw data from the VITA-49 packets that Orbital actually sends. To do this, I wrote a Python script which parses the header of each packet (since the headers are variable length) and dumps the raw I/Q payload into a file since that’s all that matters for our purposes.
This took a bit longer than you might think because the actual specification for VITA-49 <a href="https://www.vita.com/Sys/Store/Products/258942">costs a hundred dollars</a>, and I hadn’t yet been pointed to the VITA-49 compatible (but free!) <a href="https://dificonsortium.org/">“Digital IF Interoperability Standard”</a>.</p>

<blockquote>
  <p>Editors note: while trying to find the price for the VITA-49 spec again I ran across a <a href="https://www.wirelessinnovation.org/assets/documents/sdrf-07-i-0006-v0-0-0%20vita%2049%20spec.pdf">draft of the spec</a> which seems to cover everything. I guess my Google-fu was off…</p>
</blockquote>

<p>The data extraction script can be found <a href="https://gist.github.com/kallsyms/c6e29bb72c190cd4b0edc5c1511bd3f9">here</a></p>

<h2 id="demodulation">Demodulation</h2>

<p>I started with a <a href="https://github.com/altillimity/X-Band-Decoders/blob/master/Flowcharts/AQUA%20Demodulator.grc">GNU Radio flowchart</a> from the <a href="https://github.com/altillimity/X-Band-Decoders">altillimity/X-Band-Decoders</a> GitHub repo to demodulate the signal.</p>

<h3 id="aside-gnu-radio">Aside: GNU Radio</h3>
<p>GNU Radio is a open source software defined radio toolkit. It provides a ton of building <em>blocks</em> which can be chained together into <em>flowcharts</em> which implement any type of signal processing workload.
For example, the flowchart used to do the demodulation looks like this:</p>

<p><img src="/images/azure-orbital/flowchart.png" alt="GNU Radio flowchart" /></p>

<p>Don’t get me wrong, this looks very intimidating at first. I definitely still don’t understand all of it! However it’s worlds better than trying to piece together the code to do all of this signal processing yourself.</p>

<p> </p>

<p>This was probably the most finicky part of the entire process. When running any GNU Radio chart on my M1 Macbook Pro (GRC version 3.9.3.0), the graphs didn’t update at all and it seems like the entire thing froze when first starting?
After spending way too long thinking that was a bug in the chart and trying everything I could think of to make it work, I eventually ran it on my x86 Linux laptop and the graphs were updating and it seemed to be doing <em>something</em>.</p>

<p>A few minutes in to processing (once the satellite was overhead) the frequency plot looked good:</p>

<p><img src="/images/azure-orbital/spectrum.png" alt="Frequency spectrum (FFT) visualization" /></p>

<p>but the constellation plot had a weird “double image” and was showing eight clusters instead of the expected four (since this is QPSK). The demodulated data that was coming out was also not parsable by any tool I found - it seemed to be complete garbage.</p>

<p><img src="/images/azure-orbital/bad_constellation.png" alt="Bad constellation plot" /></p>

<p>Suspecting that this had something to do with clock recovery (matching the exact rate at which symbols are sampled to the rate the satellite is sending them out), I found <a href="https://www.tablix.org/~avian/blog/archives/2015/03/notes_on_m_m_clock_recovery/">a blog post</a> after some Googling around that describes what the “Clock Recovery MM” block was actually doing under the hood. Applying the things talked about in that and tweaking the block parameters, I got slightly better output however it still wasn’t great. The decode tools were getting sync, but nearly every frame was corrupted.
Finally, I saw on the <a href="https://wiki.gnuradio.org/index.php?title=Clock_Recovery_MM">“Clock Recovery MM” GNU Radio wiki page</a> that that block was actually deprecated in favor of a new “Symbol Sync” block.
I swapped that in and tried a few different algorithms, eventually settling on zero crossing which produced a great looking constellation and got the decode tools to start emitting uncorrupted frames.</p>

<p><img src="/images/azure-orbital/good_constellation.png" alt="Good constellation plot" /></p>

<p>The final flowchart is available <a href="/files/AQUA_Demodulator.grc">here</a>.</p>

<h2 id="decoding">Decoding</h2>
<p>Per the original AWS blog post, NASA’s <a href="https://directreadout.sci.gsfc.nasa.gov/?id=dspContent&amp;cid=69">RT-STPS</a> toolkit is the “official” way to decode data from AQUA (and other) satellites. Unfortunately, despite it saying it got lock on the demodulated data, every frame it processed was “unroutable.”
I dug into the source and eventually set a watchpoint where the satellite ID is extracted from the frame headers (the satellite ID being how it decides to route data), and the ID was all wrong.
I’m still unsure why this was, but I didn’t want to spend much more time on it as the tooling in the aforementioned X-Band-Decoders repo already had <a href="https://github.com/altillimity/X-Band-Decoders/tree/master/Aqua%20Decoder">decoding</a> and <a href="https://github.com/altillimity/X-Band-Decoders/tree/master/Aqua%20MODIS%20Extractor">data separation</a> utilities for the <a href="https://lpdaac.usgs.gov/data/get-started-data/collection-overview/missions/modis-overview/">MODIS</a> data (which was all I needed to produce the simple true color image I was going for).</p>

<p>After waiting on a few dependencies to build, these tools worked first try yielding a stream of uncorrupted MODIS image data frames. Nice!</p>

<h1 id="rendering-an-image">Rendering an Image</h1>

<p>The X-Band-Decoders repo was once again helpful, pointing me to <a href="https://github.com/rocketscientist-fred/weathersat">weathersat</a>.
As the README in the repo says,</p>

<blockquote>
  <p>If you don’t read this README with attention, as well as the</p>

  <p>./hrpt.exe –help</p>

  <p>output, you will (!!) fail to successfully run the s/w. Especially the environment variables described below are crucial !!!!</p>
</blockquote>

<p>Promptly ignoring this, I spent an hour or so trying to get it to work to little avail.</p>

<p>Going back and looking at the output of <code class="highlighter-rouge">--help</code> though, there’s a nice example of how to use the utility to render a real color image from MODIS data - exactly what I wanted!
After stumbling over a couple last things (spaces in directory names breaking stuff and a missing trailing <code class="highlighter-rouge">/</code> in the necessary envvars), I had an image:</p>

<p><img src="/images/azure-orbital/AQ_MODIS-2022-03-19-2100-bandxx_ch11_correct.jpg" alt="2022-03-19 21:00 MODIS" /></p>

<p>This process having taken a few days to perfect, I also had another capture ready to process. Running it through resulted in another pretty decent image:</p>

<p><img src="/images/azure-orbital/AQ_MODIS-2022-03-21-2056-bandxx_ch11_correct.jpg" alt="2022-03-21 20:56 MODIS" /></p>

<p>(For reference on what you’re seeing geographically speaking, the peninsula visible at the bottom of the images is Baja California)</p>

<p>Success!</p>

<h1 id="other-captures">Other Captures</h1>

<p>I received data for a total of five AQUA passes (weighing in at ~100GB total!), however only two of them had usable data.
I’m sure there’s more tweaking that could be done in the demod/decode steps which would probably yield more usable frames, but even the images I produced above have significant bands of little to no reception.
As the images show, it was somewhat cloudy over the datacenter the antenna was located at (and these captures were all done within a few days of eachother), so perhaps the weather was interfering?
Given that these are relatively high frequency signals (8.16GHz), I <em>think</em> atmospheric conditions could have an effect…</p>

<p>Either way, I got a couple cool images so I was very much content :)</p>

<h1 id="conclusion">Conclusion</h1>

<p>This entire experiment occurred over the span of about two months from first requesting access to getting images out, but out of that span I only spent about three days actively working on it. Surprisingly quick for such a project I think!</p>

<p>All in all, it was quite a fun thing to spend some time on. I learned quite a bit more about software defined radio (hopefully you did as well!) and more than I would have ever liked to about VITA-49.</p>

<p>As always, feel free to reach out with any questions or feedback. I also still have the raw capture data if anyone would like a copy of it to experiment with the demod/decode/render steps themselves.</p>

<h1 id="appendix-orbital-qa">Appendix: Orbital Q&amp;A</h1>

<blockquote>
  <p>Q: What is the Gain/Temperature field? As far as I know this is normally a characteristic of the receiving system, not a tunable parameter?</p>

  <p>A: The G/T field is a requirement passed by the user to the system. So you aren’t setting the G/T but rather requesting a min G/T spec. This is because Orbital integrates across many first party and partner sites with various antenna sizes. So if you needed a certain bar of performance when you query availability from site to site you can have that guarantee by specifying whatever G/T your link needs. We are not filtering on this yet in the near-term so feel free to put a placeholder value here.</p>
</blockquote>

<blockquote>
  <p>Q: What is the format for the demodulation and decoding configuration?</p>

  <p>A: The argument is an unvalidated blob or string type that is a copy/paste of the modem config file. Right now we offer Kratos modems in this mode.</p>
</blockquote>

<blockquote>
  <p>Q: How is the data received actually encoded?</p>

  <p>A: Azure Orbital leverages DIFI for its RF transport layer. Those details can be downloaded here at https://dificonsortium.org/, and Microsoft had a significant hand in the creation of this consortium. To that effect, our SDR team has released the GNU Radio Azure Software Radio toolbox publicly available on GitHub. This lets you interface directly with Orbital in GNU Radio without any need for manual coding or modding! All you have to do is specify your VM with this toolbox loaded as the endpoint. Check it out here: https://github.com/microsoft/azure-software-radio</p>
</blockquote>

<p>N.B. The Azure Software Radio toolbox only supports reading data from a socket (doing all of the processing as it’s streaming in), or from a Azure blob storage file.</p>]]></content><author><name>Nick Gregory</name></author><category term="meteorology" /><summary type="html"><![CDATA[If you follow AWS closely, you may have heard about a niche product launch a few years back called Ground Station which lets you rent, well, a ground station (basically a big antenna plus supporting equipment to communicate with satellites). A friend recently linked me an AWS blog post with a sample use case which described using it as a way of receiving real time imagery from orbiting weather satellites. Now funny enough, receiving data from polar orbiting weather satellites has been a side project of mine for over a decade now, but living in NYC has put a bit of a hold on it. I used to have a home-built QFH antenna which I used to receive images with a surprisingly high success rate given the janky construction of it.]]></summary></entry><entry><title type="html">The Discovery and Exploitation of CVE-2022-25636</title><link href="https://www.nickgregory.me/post/2022/03/12/cve-2022-25636/" rel="alternate" type="text/html" title="The Discovery and Exploitation of CVE-2022-25636" /><published>2022-03-12T00:00:00+00:00</published><updated>2022-03-12T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2022/03/12/cve-2022-25636</id><content type="html" xml:base="https://www.nickgregory.me/post/2022/03/12/cve-2022-25636/"><![CDATA[<p>A few weeks ago, I found and reported CVE-2022-25636 - a heap out of bounds write in the Linux kernel.
The bug is exploitable to achieve kernel code execution (via ROP), giving full local privilege escalation, container escape, whatever you want.</p>

<p>In this post, I cover the entire process of finding and exploiting the bug (to as much of an extent as I did at least) - from initial “huh that looks weird” to a working LPE.</p>

<p>It’s a long post, but hopefully this will be useful to others (especially those newer to kernel exploitation) to get a feel for what my process was like.</p>

<p>Finally, if you’re just here for the exploit details and don’t want the backstory of me discovering it, feel free to <a href="#exploitation">skip head</a>.</p>

<h1 id="bug-hunting">Bug Hunting</h1>

<p>One night a few weeks back, I was bored. There were a few other projects I could have worked on,
but none of them seemed particularly interesting, so I decided to do some random (kernel) code review.
There have been a few notable bugs in the netfilter kernel subsystem that I’ve seen over the past few years
(perhaps most notably <a href="https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html">CVE-2021-22555</a>), so I decided to start looking there.
It’s a relatively complex subsystem that’s widely available - the perfect target.</p>

<h2 id="aside-what-is-netfilter">Aside: What is netfilter?</h2>

<p>Netfilter, as the <a href="https://www.netfilter.org/">project’s website</a> says, “enables packet filtering, network address [and port] translation (NA[P]T), packet logging, userspace packet queueing and other packet mangling.”
You’ve probably interacted with netfilter before without knowing about it!
Ever used <code class="highlighter-rouge">iptables</code> to block inbound traffic on a server, or configured a Linux box as a router with NAT?
All of that packet processing is done in the kernel by netfilter.</p>

<p>I’ve done a bunch of stuff with <code class="highlighter-rouge">iptables</code> in the past, but other than that I wasn’t familiar with anything else netfilter provided (and definitely didn’t know anything about how it worked),
so I clicked around on some files in the subsystem’s <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter">main source directory</a> to try and get a lay of the land.</p>

<p>I started at the top by looking at a few of the (what seemed to be) protocol parsers. Parsing non-trivial data is always potentially error-prone, so it felt like a good place to start.
I ended up focusing on the parts of the code taking configuration input from userland (over a netlink socket), as while a bug in packet processing would be interesting,
the decoder would still have to be “activated” by some configuration from userland in the first place.</p>

<p><em>Editor’s note</em>: perhaps it’s worth taking another look at these since <a href="https://syzkaller.appspot.com/upstream">syzkaller</a> doesn’t show much of any coverage on these files so maybe there’s something lurking…</p>

<p>Anyways, after going through <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nf_conntrack_ftp.c"><code class="highlighter-rouge">nf_conntrack_ftp.c</code></a> and a few others without seeing much interesting,
I was scrolling through looking for other “types” of code and saw <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nf_dup_netdev.c"><code class="highlighter-rouge">nf_dup_netdev.c</code></a>.
I was actually just about to click on some other file when I saw that and thought “well maybe there could be some refcounting bug if something is duplicated” so I decided to look in there.</p>

<p>It’s quite a short file, but <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nf_dup_netdev.c#L67">line 67</a></p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>entry = &amp;flow-&gt;rule-&gt;action.entries[ctx-&gt;num_actions++];
</code></pre></div></div>

<p>stood out to me for two particular reasons:</p>

<ol>
  <li>It was incrementing <code class="highlighter-rouge">ctx-&gt;num_actions</code> and using it as the index into an array without any bounds checking</li>
  <li>The index (<code class="highlighter-rouge">ctx-&gt;num_actions</code>) and the array itself (<code class="highlighter-rouge">flow-&gt;rule-&gt;action.entries</code>) are struct members of two completely different variables, not obviously related. That is, the line is equivalent to <code class="highlighter-rouge">a-&gt;b[x-&gt;y]</code> which seems potentially more “suspicious” than <code class="highlighter-rouge">a-&gt;b[a-&gt;c]</code>.</li>
</ol>

<p>None of these reasons made this a definite bug (yet) of course, however it definitely “smelled” which prompted a bit more digging.</p>

<h1 id="is-it-a-bug">Is It a Bug?</h1>

<p>I had a few immediate questions:</p>

<ul>
  <li>What determines the size of the <code class="highlighter-rouge">action.entries</code> array?</li>
  <li>How is <code class="highlighter-rouge">nft_fwd_dup_netdev_offload</code> called? And what controls how many times it’s called?</li>
  <li>When/how is <code class="highlighter-rouge">ctx</code> initialized?</li>
</ul>

<p>At this point I also realized that this was in <code class="highlighter-rouge">nft_fwd_dup_netdev_</code><strong><code class="highlighter-rouge">offload</code></strong>. Even if this bug was real, it may only be reachable on systems with Network Interface Cards (NICs) with support for packet processing offload, which are very rare (and very expensive).
It would still be a bug, but maybe not the most interesting bug in the world.</p>

<p>Pulling up the x-refs of <code class="highlighter-rouge">nft_fwd_dup_netdev_offload</code> showed it was called in a <code class="highlighter-rouge">.offload</code> handler of the <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nft_dup_netdev.c#L67"><code class="highlighter-rouge">dup</code></a> and <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nft_fwd_netdev.c#L77"><code class="highlighter-rouge">fwd</code></a> <code class="highlighter-rouge">nft_expr_type</code>s.
Looking at the references for the <code class="highlighter-rouge">offload</code> struct member (which is really not a pleasant experience in Elixir…), I found <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nf_tables_offload.c#L125">this use</a> which answered all but one of the questions above:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctx = kzalloc(sizeof(struct nft_offload_ctx), GFP_KERNEL);

...

while (nft_expr_more(rule, expr)) {
  if (!expr-&gt;ops-&gt;offload) {
    err = -EOPNOTSUPP;
    goto err_out;
  }
  err = expr-&gt;ops-&gt;offload(ctx, flow, expr);
  if (err &lt; 0)
    goto err_out;

  expr = nft_expr_next(expr);
}
</code></pre></div></div>

<ul>
  <li><strong>How is <code class="highlighter-rouge">nft_fwd_dup_netdev_offload</code> called?</strong>: It’s indirectly called as part of <code class="highlighter-rouge">nft_flow_rule_create</code>.</li>
  <li><strong>What controls how many times <code class="highlighter-rouge">nft_fwd_dup_netdev_offload</code> is called?</strong>: Offload handlers (and therefore <code class="highlighter-rouge">nft_fwd_dup_netdev_offload</code> for fwd/dup expressions) are called for every expression in the rule which has one. No other checks.</li>
  <li><strong>When/how is <code class="highlighter-rouge">ctx</code> initialized?</strong>: For each rule created, the context is zero-initialized and the same instance is passed to each offload handler.</li>
</ul>

<p>More importantly than all of those however, the answer to the most interesting question was just above:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>expr = nft_expr_first(rule);
while (nft_expr_more(rule, expr)) {
  if (expr-&gt;ops-&gt;offload_flags &amp; NFT_OFFLOAD_F_ACTION)
    num_actions++;

  expr = nft_expr_next(expr);
}

...

flow = nft_flow_rule_alloc(num_actions);
</code></pre></div></div>

<p>We see that for each expression in the rule, a <code class="highlighter-rouge">num_actions</code> counter is incremented <em>only when the expression has a certain bit (<code class="highlighter-rouge">NFT_OFFLOAD_F_ACTION</code>) set</em> in <code class="highlighter-rouge">ops-&gt;offload_flags</code>.
Quickly checking back at the definition for the <code class="highlighter-rouge">dup</code> and <code class="highlighter-rouge">fwd</code> expressions, neither of them have <code class="highlighter-rouge">NFT_OFFLOAD_F_ACTION</code> set.
In fact, there’s only one use of <code class="highlighter-rouge">NFT_OFFLOAD_F_ACTION</code> at all: in the <code class="highlighter-rouge">immediate</code> expression type (<a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nft_immediate.c#L227">here</a>).</p>

<p>At this point I was pretty confident there was a bug.
As far as I could tell, the only thing that could prevent it would be if there was some enforcement of having one immediate per dup/fwd rule.</p>

<h2 id="checking-for-exploitability">Checking for Exploitability</h2>

<p>Unfamiliar with how to “talk” to nftables, I searched around for some examples of what a nftables table/chain definition looks like and how to install it.
<a href="https://www.spinics.net/lists/netfilter/msg59251.html">One mailing list post</a> was particularly useful as it had everything needed, including how to set the <code class="highlighter-rouge">offload</code> flag which is required to reach the bug (because of <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/netfilter/nf_tables_api.c#L3423">this</a> check).</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>table netdev filter_test {
  chain ingress {
    type filter hook ingress device eth0 priority 0; flags offload;
    ip daddr 192.168.0.10 tcp dport 22 drop
  }
}
</code></pre></div></div>

<p>With that sample in hand, I started playing around with nftables to see if/how the bug could be hit.</p>

<p>First, I setup a kprobe on <code class="highlighter-rouge">flow_rule_alloc</code> (responsible for creating our <code class="highlighter-rouge">action.entries</code> array) with a fetcharg to show the <code class="highlighter-rouge">num_actions</code> argument: <code class="highlighter-rouge">sudo kprobe-perf -F 'p:flow_rule_alloc num_actions=%di:u32'</code>.
This immediately failed because (at least on Ubuntu) nftables is a lazily loaded kernel module so the code wasn’t actually loaded yet. Oops.
After quickly running <code class="highlighter-rouge">nft -a mailing_list.nft</code> (which forced the kernel module to load even though the command itself failed), I could actually set the kprobe.</p>

<p>Running <code class="highlighter-rouge">nft -a mailing_list.nft</code> for real this time resulted in a kprobe hit (despite the rule installation failing):</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo nft -f mailing_list.nft
a.nf:1:1-2: Error: Could not process rule: Operation not supported
table netdev x {
^^
</code></pre></div></div>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo kprobe-perf 'p:flow_rule_alloc num_actions=%di:u32'
Tracing kprobe flow_rule_alloc. Ctrl-C to end.
             nft-20137   [001] .... 1573655.306178: flow_rule_alloc: (flow_rule_alloc+0x0/0x60) num_actions=1
</code></pre></div></div>

<p>So <code class="highlighter-rouge">flow_rule_alloc</code> was indeed being hit even though the VM I was testing in definitely didn’t have a network device capable of hardware offload!
The system didn’t crash or anything so it seemed like the buggy behavior wasn’t getting hit yet, but this was good progress.</p>

<p>And it was at this point that I realized I had never changed the example from the mailing list to actually include a <code class="highlighter-rouge">dup</code> expression. Oops again.
After changing the rule to <code class="highlighter-rouge">ip daddr 192.168.0.10 dup to eth0</code> though, my system annoyingly remained in a non-<code class="highlighter-rouge">panic</code>d state.</p>

<p>Before continuing, I also wanted to try running the <code class="highlighter-rouge">nft</code> commands after <code class="highlighter-rouge">unshare</code>ing into a new user and network namespace (<code class="highlighter-rouge">unshare -Urn</code>) to see if it’s possible to reach this as an unprivileged user. Sure enough it was, making this bug potentially even more potent.</p>

<p>Back to the bug itself though: poking around through the <code class="highlighter-rouge">nft</code> man pages, I found you could pass <code class="highlighter-rouge">-d netlink</code> which ended up being incredibly useful as it showed the “disassembly” of the rule that was being sent to the kernel:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[ meta load protocol =&gt; reg 1 ]
[ cmp eq reg 1 0x00000008 ]
[ payload load 4b @ network header + 16 =&gt; reg 1 ]
[ cmp eq reg 1 0x0a00a8c0 ]
[ immediate reg 1 0x00000001 ]
[ dup sreg_dev 1 ]
</code></pre></div></div>

<p>From this, it’s apparent why the bug isn’t being triggered: the CLI generates an immediate expression before the <code class="highlighter-rouge">dup</code> (representing the device the packet should be duplicated to), so the accounting was “working”.
Is it possible to have a <code class="highlighter-rouge">dup</code> without a preceeding <code class="highlighter-rouge">immediate</code>?
I couldn’t find a way to have the CLI install a rule from this disassembled format (so couldn’t force it to generate <code class="highlighter-rouge">dup</code>s with no <code class="highlighter-rouge">immediate</code>s),
so it was time to go deeper and manually create the packets to send to the subsystem.</p>

<h3 id="golang-implementation">Golang Implementation</h3>

<p>I have a love/hate relationship with Go, but that’s a blog for another time.
At the end of the day, it’s basically the only language that has a large community (and therefore a large selection of libraries) that’s low enough level to do what I need for this,
but also high enough level to not make me want to throw my computer out the window while I’m trying to get something to work.
So I started building a proof of concept in Go.</p>

<p>Conveniently, Google has a go <code class="highlighter-rouge">nftables</code> <a href="https://github.com/google/nftables">library</a> which looked like a good starting point since I’d be able to manually construct the rule.
Unfortunately, it didn’t expose quite everything I needed (mainly around setting the offload flag) and by the time I discovered this, I was a few hours into building around it and really didn’t want to rewrite it in C.
I cobbled together some truely awful code which used reflection to overwrite the private array of messages to send, manually constructed the necessary chain creation message with the proper bit flipped, etc. etc. and another hour or so later I was back to where I started with the <code class="highlighter-rouge">nft</code> CLI.</p>

<p>I added another <code class="highlighter-rouge">dup</code> without an <code class="highlighter-rouge">immediate</code> before it, ran it and…</p>

<p>…</p>

<p>not much happened. It errored out with the normal “operation not permitted”, but nothing else. So at least it didn’t get rejected because of missing immediates which was good I guess?</p>

<p>Then, a few seconds later, kaboom. The kernel panicked and my shell was dead. We have a bug!</p>

<p>Now comes the fun part.</p>

<h1 id="exploitation">Exploitation</h1>

<p>Analyzing what our bug actually provides us (with the help of <code class="highlighter-rouge">pahole</code> to get struct offsets), we see that there are 2 out of bounds writes:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>entry = &amp;flow-&gt;rule-&gt;action.entries[ctx-&gt;num_actions++];
entry-&gt;id = id;
entry-&gt;dev = dev;
</code></pre></div></div>

<ol>
  <li>The write of <a href="https://elixir.bootlin.com/linux/v5.16.11/source/include/net/flow_offload.h#L199"><code class="highlighter-rouge">enum flow_action_id id</code></a> immediately after the end of the array, writing the value 4 or 5 (depending on whether this is a <code class="highlighter-rouge">fwd</code> or <code class="highlighter-rouge">dup</code>) expression</li>
  <li>The write of <a href="https://elixir.bootlin.com/linux/v5.16.11/source/include/net/flow_offload.h#L205"><code class="highlighter-rouge">struct net_device *dev</code></a> 24 bytes past the end of the array</li>
</ol>

<p>As for the sizes of everything (on my Ubuntu test VM with a 5.13 kernel), the base <code class="highlighter-rouge">flow_rule</code> structure is 32 bytes and each additional <code class="highlighter-rouge">entry</code> in the array is 80 bytes. This means:</p>

<ul>
  <li>If there are no immediates in our rule, the size of the rule allocation will be 32 and will be allocated in the kmalloc-32 slab</li>
  <li>One rule gives an allocation of size 112, landing in the kmalloc-128 slab</li>
  <li>Two rules gives an allocation of size 192, landing in the kmalloc-192 slab</li>
  <li>and so on</li>
</ul>

<p>Focusing on the <code class="highlighter-rouge">dev</code> pointer write, the above allocation sizes means that the write will be either at offset 24 of the next 32- or 192-slab allocation, or at offset 8 of the next 128-slab allocation.
I manually hunted around through <code class="highlighter-rouge">pahole</code>’s output looking for any interesting structure which had a pointer at the necessary offset, but came up empty handed.
Everything that I found was either in a subsystem that required elevated privileges to access, in a subsystem that is “exotic” (probably not easily reachable), or in a subsystem which I felt was too flaky to try and land in (e.g. the scheduler).</p>

<p>Long story short, I put this aside and came back to it a couple days later with fresh eyes.</p>

<p>While reading through Alexander Popov’s writeup of <a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">another recent kernel bug</a> looking for inspiration the thought occurred to me:
we have the ability to cause <strong>multiple</strong> of these out of bounds writes, not just one (since multiple <code class="highlighter-rouge">dup</code>s can be put in a rule).
So in addition to hitting offset 8 of the next 128-slab allocation, we could also hit offset 88 of that allocation,
or offset 40 of the 2nd next allocation, or offset 120 of the 2nd next, or…</p>

<p>Having just read that writeup in which Alexander uses the security pointer (<strong>at offset 40</strong>) to land a <code class="highlighter-rouge">kfree</code>, the exploit path became obvious.</p>

<p>What we do is:</p>
<ul>
  <li>Spray a bunch of System V message queue messages, causing the kernel to allocate a lot of <code class="highlighter-rouge">msg_msg</code> structures of a controlled size. For now, we care about landing in the kmalloc-128 slab</li>
  <li>Free some of them</li>
  <li>Add the netlink rule, causing the <code class="highlighter-rouge">flow_rule</code> allocation to hopefully land in one of the the just free’d heap slots</li>
  <li>Do our OOB write a total of 3 times (i.e. have 3 <code class="highlighter-rouge">dup</code>s in our rule with no <code class="highlighter-rouge">immediate</code>), clobbering
    <ul>
      <li>The <code class="highlighter-rouge">list_head.prev</code> pointer (offset 8) of the next message on the heap</li>
      <li>Some random data (offset 88) in the contents of the next message on the heap</li>
      <li>The <code class="highlighter-rouge">security</code> pointer (offset 40) of the 2nd next message on the heap</li>
    </ul>
  </li>
  <li>Find and <code class="highlighter-rouge">msgrcv</code> the 2nd next message, causing the kernel to <code class="highlighter-rouge">kfree()</code> the <code class="highlighter-rouge">net_device</code> (since it was a <code class="highlighter-rouge">net_device</code> pointer that was written)</li>
  <li>Allocate some more messages, but this time in the kmalloc-4k slab with the goal of landing in the <code class="highlighter-rouge">net_device</code> that was just free’d</li>
  <li>Cause the kernel to do something on the device which would cause a function pointer in the (now controlled) <code class="highlighter-rouge">net_device.netdev_ops</code> operations struct to be called, giving us code execution. Reading from <code class="highlighter-rouge">/proc/net/dev</code> is a simple answer to this (causing <code class="highlighter-rouge">netdev_ops-&gt;ndo_get_stats64</code> to <a href="https://elixir.bootlin.com/linux/v5.16.11/source/net/core/dev.c#L10697">be called</a>) which is what I ended up using.</li>
</ul>

<p>This chain is <em>incredibly</em> nice. Just to highlight a few benefits:</p>

<ul>
  <li>We know exactly which <code class="highlighter-rouge">msg_msg</code> had its <code class="highlighter-rouge">list_head.prev</code> pointer clobbered (and is therefore unsafe to free) since we can <code class="highlighter-rouge">MSG_COPY</code> it out of the queue (which wont touch the next/prev pointers since it’s not actually removed) and look to see if the contents of the message have changed.</li>
  <li>In addition to telling us which message is “dangerous”, this also leaks the kernel heap pointer that we’re going to be landing in, making it trivial to start ROPing (more on this <a href="#sidenote-rop">below</a>).</li>
  <li>We also know exactly which message had its <code class="highlighter-rouge">security</code> pointer overwritten. We could either add a 4th <code class="highlighter-rouge">dup</code> (and again look at message data after copying it), or we can look at the messages <code class="highlighter-rouge">mtype</code> after it’s copied out. Remember how 2 things are written out of bounds (4 or 5, and the pointer)? It just so happens that the 4 or 5 gets written over the message’s <code class="highlighter-rouge">mtype</code> (offset 16), so by checking if the <code class="highlighter-rouge">mtype</code> changed from whatever value was put in, we can tell if we have the right message.</li>
</ul>

<p>By the end of the night (perhaps staying up a <em>bit</em> too late…), I had the first working proof of concept for this (in an ARM VM not x86, hence the different registers and whatnot).</p>

<p><img src="/images/cve-2022-25636/Screen%20Shot%202022-02-15%20at%2002.28.17.png" alt="A panic!" /></p>

<p>Success!</p>

<p>A few more hours of hacking on this though, and I hadn’t gotten much closer to code execution.</p>

<p>For some reason, the exploit was incredibly flaky (i.e. it had a very low success rate). I figured this was either because:</p>

<ol>
  <li>The kernel <a href="https://mxatone.medium.com/randomizing-the-linux-kernel-heap-freelists-b899bb99c767">freelist randomization</a> was more potent than I thought against this</li>
  <li>All of the stuff the go runtime does in the background was messing with the kernel heap.</li>
  <li>Other things running on the system were causing sporadic <code class="highlighter-rouge">kmalloc-128</code> allocations, throwing off/using up the freelist</li>
</ol>

<p>I tried changing everything to work out of the <code class="highlighter-rouge">kmalloc-2048</code> slab (since all of the offset math still works out), but this didn’t seem to help at all.
At this point I probably should have spent some time with a kernel debugger, tracing exactly what was happening with the freelist, but I decided to go ahead and rewrite the exploit in C to see if that would help.
If nothing else, it’d probably make later stages of the exploit much easier to work with since I wouldn’t have to try and link in some other thing that the kernel could jump to as a final stage of the exploit.</p>

<h2 id="rewriting">Rewriting</h2>

<p>Boy was this a nightmare. There <em>is</em> a C library for “nicely” working with nftables, however at the end of the day it’s C so nothing is really “nice.”
After many hours of staring at <code class="highlighter-rouge">strace</code> output of the netlink packets trying to figure out what I was missing in the C code, I eventually got back to where I was in goland.
If you’re interested, the code necessary to interface with nftables is available in the <a href="https://www.openwall.com/lists/oss-security/2022/02/21/2">reproducer</a> I posted to the oss-security mailing list.</p>

<p>But it wasn’t any more stable. Damn.</p>

<p>Another couple days of messing around (mainly trying to figure out if there was a specific order in which to free the initial messages to best get around freelist randomization), I got to a point where the exploit was ~30% successful which was good enough to proceed with.
It’s entirely possible I was missing something obviously broken in my exploit code, but if you have any ideas about something I could be missing kernel-side, please do drop me an email or DM - I would really like to know what’s going on.</p>

<p>Having spent enough time on this already, I decided to forgo making this into a full exploit. I just wanted to get my root shell and call it a day.
I disabled SMEP, SMAP, KPTI, and KASLR on my test VM, and put together a quick “callback” (getting me root and out of any container/namespace) which I could jump directly to from the kernel:</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void *get_task(void) {
    void *task;
    asm volatile ("movq %%gs: 0x1fbc0, %0":"=r"(task));
    return task;
}

void *elevate(void *dev, void *storage) {
    void *c = ((void * (*)(int))(prepare_kernel_cred))(0);
    ((void (*)(void *))(commit_creds))(c);
    void *current = get_task();
    ((void (*)(void *, void *))(switch_task_namespaces))(current, (void *)init_nsproxy);
    return NULL;
}
</code></pre></div></div>

<p>And that’s basically it. Minus the whole “it only works 30% of the time,” the exploit was done, and I got my shell after a few attempts.</p>

<p><img src="/images/cve-2022-25636/Screenshot%20from%202022-03-08%2023-28-57.png" alt="root" /></p>

<p>And before you go burning cycles trying to crack that password hash, it’s just <code class="highlighter-rouge">vagrant</code> :P</p>

<h2 id="sidenote-rop">Sidenote: ROP</h2>

<p>While I didn’t end up implementing it in my exploit, we’re in an amazing position to ROP (making SMEP/SMAP/KPTI a non-issue).
Since the kernel heap address of the <code class="highlighter-rouge">net_device</code> is leaked, we know where our message data is going to be in memory.
That pointer can then be used to compute an address for our fake <code class="highlighter-rouge">netdev_ops</code> (putting it somewhere else in our message),
and then when the kernel goes to call a function taken from that ops structure (with the <code class="highlighter-rouge">net_device</code> (/our message) as the first argument),
we can give it the address of a simple <code class="highlighter-rouge">mov rsp, rdi; ret</code> gadget to stack pivot on to our message.
From there, anything is possible.</p>

<p>The only thing missing is a KASLR leak, but that’s not much of a barrier :)</p>

<h1 id="code">Code?</h1>

<p>In the couple of weeks it took me to write up this blog post, <a href="https://twitter.com/Bonfee1/status/1500837241991618565">@Bonfee</a> already independently developed an exploit for the bug and published it!</p>

<p>I haven’t looked through the entirety of their implementation, but it seems to use a similar path to what I describe above. However, it also includes a full ROP chain and KASLR leak making it far more complete than mine. I’d recommend you check it out! <a href="https://github.com/Bonfee/CVE-2022-25636">https://github.com/Bonfee/CVE-2022-25636</a></p>

<h1 id="wrapping-up">Wrapping Up</h1>

<p>This was a really fun bug to discover and work on. From start to end, it took just under a week to find, triage the bug, figure out how to hit it, and build the exploit.
While not novel, the OOB write primitive we get with it is also pretty interesting, and makes for quite a clean exploit as we’ve seen.</p>

<p>I hope you’ve enjoyed reading, and of course reach out with any questions you may have.</p>]]></content><author><name>Nick Gregory</name></author><category term="linux" /><category term="security" /><summary type="html"><![CDATA[A few weeks ago, I found and reported CVE-2022-25636 - a heap out of bounds write in the Linux kernel. The bug is exploitable to achieve kernel code execution (via ROP), giving full local privilege escalation, container escape, whatever you want.]]></summary></entry><entry><title type="html">A snapshotting kernel module for fuzzing</title><link href="https://www.nickgregory.me/post/2021/12/10/afl-kmod/" rel="alternate" type="text/html" title="A snapshotting kernel module for fuzzing" /><published>2021-12-10T00:00:00+00:00</published><updated>2021-12-10T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2021/12/10/afl-kmod</id><content type="html" xml:base="https://www.nickgregory.me/post/2021/12/10/afl-kmod/"><![CDATA[<p>Right as the pandemic was starting in March/April 2020, I spent a couple of weekends writing a Loadable Kernel Module (LKM) for Linux,
designed to add a syscall which could be used by a fuzzer to quickly restore program state instead of using a conventional fork/exec loop.
This was <a href="https://github.com/AFLplusplus/AFLplusplus/issues/248">originally suggested</a> on the AFL++ <a href="https://github.com/AFLplusplus/AFLplusplus/blob/stable/docs/ideas.md">Ideas page</a>, and it nicely intersected a bunch of stuff I’m familiar with so I wanted to take a crack at it.</p>

<p>My implementation can be found in the now archived GitHub repo: <a href="https://github.com/kallsyms/snapshot-lkm">https://github.com/kallsyms/snapshot-lkm</a>.
It’s deprecated in favor of <a href="https://github.com/AFLplusplus/AFL-Snapshot-LKM">the AFL++ version</a>, however as of Dec. 2020 that has also been frozen as it’s a significant amount of work to update it for each kernel version, since the module requires hooking some internal kernel functions which change frequently.</p>

<h1 id="overview">Overview</h1>
<p>My initial work was heavily based on the original kernel patchset from the SSLab at Georgia Tech, which can be found <a href="https://github.com/sslab-gatech/perf-fuzz">here</a>.
I’d <strong>strongly</strong> recommend reading <a href="https://gts3.org/assets/papers/2017/xu:os-fuzz.pdf">the original paper</a> to understand more about how the innerds of the snapshotting work.
To summarize though, the basic idea is to add a new syscall (<code class="highlighter-rouge">snapshot()</code>) which can either snapshot or restore the “important” bits of the current process so a new fuzz case can be run.
This avoids the excessive overhead of a normal <code class="highlighter-rouge">fork()</code>, giving a very nice speed up versus conventional fuzzers.</p>

<h1 id="development-process">Development Process</h1>

<h2 id="understanding-the-original-implementation">Understanding the Original Implementation</h2>

<p>The first thing I needed to do was actually extract a diff/patch of what the paper implemented.
The main repo is (unfortunately) a full fork of linux, but squashed so we can’t easily <code class="highlighter-rouge">git diff</code> to see what was implemented.
A quick non-git <code class="highlighter-rouge">diff</code> against a freshly-cloned linux v4.8.10 repo quickly fixed that, giving us <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch">the patch</a>.</p>

<p>I was surprised at how small it was.</p>

<p>There were only a total of 4 files that had meaningful changes which would affect normal program flow.
The rest are either header files, syscall definitions, or the snapshot/restore implementation itself.</p>

<p>Breaking down each major function change:</p>

<ul>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L41"><code class="highlighter-rouge">file.c:dup_fd</code></a>: when a <code class="highlighter-rouge">files_struct</code> (basically the set of file descriptors opened by a task) is duplicated (e.g. in <code class="highlighter-rouge">fork()</code>), the newly created <code class="highlighter-rouge">files_struct</code> needs to have its snapshot metadata initialized.</li>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L281"><code class="highlighter-rouge">exit.c:do_group_exit</code></a>: when a task exits as part of the entire group going down, snapshot metadata needs to be cleaned up.</li>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L295"><code class="highlighter-rouge">exit.c:exit_group</code></a>: when a task calls <code class="highlighter-rouge">exit()</code>, the snapshot is implicitly restored.</li>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L315"><code class="highlighter-rouge">fork.c:dup_mm</code></a>: when a task’s memory mappings are duplicated, the new <code class="highlighter-rouge">mm_struct</code>’s snapshot metadata needs to be initialized.</li>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L970"><code class="highlighter-rouge">memory.c:do_wp_page</code></a>: when a page fault occurs (when writing to a copy-on-write page), the snapshotting code may have some work to do.</li>
  <li><a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L1035"><code class="highlighter-rouge">memory.c:do_anonymous_page</code></a>: when an anonymous (non-file-backed) page is accessed for the first time, the page that was mapped (added to a PTE) needs to be recorded, as the PTE may need to be restored.</li>
</ul>

<p>With this understanding of what’s needed to “inject” into the kernel, let’s talk a bit about how I went about doing that.</p>

<h2 id="hooking-kprobes">Hooking: Kprobes</h2>

<p>Linux has some crazy built-in tech that very few people know about. One of these is kernel probes, or kprobes.
Kprobes are a way for things (be it a superuser in userland using the sysfs interface or another kernel module using the kernel interface) to, well, probe the kernel.
You can set probe points on nearly any function in the kernel (even ones not EXPORTed for normal module use), and fetch values from the state at the time the probe is hit.
And if you’re using the kernel-land interface (i.e. from a module), you can even overwrite registers (including the instruction pointer!) when your callback fires.</p>

<p>Almost everything in the snapshot process could be written “out-of-band” of the normal kernel functions (meaning it’s just observing what the kernel is doing and tracking state outside of any normal kernel structures),
however in <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/perf-fuzz.patch#L1021">one place</a>, the modifications cause a function to return early.</p>

<p>There’s a neat trick you can do with kprobes to emulate this behavior: set the instruction pointer to a stub function which immediately returns.
Since that stub was never actually <code class="highlighter-rouge">call</code>ed (specifically, since no return instruction pointer was pushed to the stack), when that stub <code class="highlighter-rouge">ret</code>urns, it will pop off the return IP that the probed function should have returned to, effectively giving us a way to return early.
This will only work if the probe is on the very first instruction of a function (otherwise the stack may have been expanded by the probed function), but this will be the case for us so we’re set.
<a href="https://www.kernel.org/doc/Documentation/kprobes.txt#:~:text=If%20you%20change%20the%20instruction%20pointer">The docs</a> have a bit more detail about what you actually need to do to achieve this with the kprobe subsystem.</p>

<h2 id="hooking-syscall-table">Hooking: syscall table</h2>

<p>In addition to the purely-additive things we need to run when certain kernel functions are called, we also need to completely hijack the <code class="highlighter-rouge">exit</code> syscall and add a new syscall entirely to do our snapshotting.</p>

<p>Side note: as the AFL++ devs did in their version, the snapshot operation should probably have been implemented as an <code class="highlighter-rouge">ioctl</code> instead.
However, since I was treating this as a proof-of-concept and I already needed to do syscall table rewriting for <code class="highlighter-rouge">exit()</code> I figured I might as well do the same for <code class="highlighter-rouge">snapshot()</code>, and chose to overwrite the <code class="highlighter-rouge">tuxcall()</code> syscall since it’s completely unused.</p>

<p>Anyways, to get control over the syscalls we need to overwrite the syscall table which Linux uses to dispatch syscalls to their respective handlers.
If the kernel is “nice” and has the <code class="highlighter-rouge">sys_call_table</code> as a named symbol, we can use that.
In the case it doesn’t though, the quickest way I found to do this is find where in kernel memory the address of the <code class="highlighter-rouge">read()</code> syscall handler is immediately followed by the address of the <code class="highlighter-rouge">write()</code> syscall handler, since those are the first two syscalls. This is implemented in <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/module.c#L84"><code class="highlighter-rouge">get_syscall_table</code></a>.</p>

<p>The only other thing we need to do to hook the syscall table is make sure we make that memory writable before trying to overwrite it. And to do that, I decided to temporarily disable the write protect bit (bit 16) in cr0 instead of messing around with properly making the memory R/W. Again, proof-of-concept code :)</p>

<h1 id="implementation">Implementation</h1>

<p>Now, with all of that out of the way, let’s do a quick overview of the module implementation.</p>

<p>Starting at the (logical) top, in <code class="highlighter-rouge">mod_init</code> we grab the address of the syscall table, flip the WP bit in cr0, save the existing handlers, and overwrite the handler pointers with our own.</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void **syscall_table = get_syscall_table();
...
_write_cr0(read_cr0() &amp; (~(1 &lt;&lt; 16)));
orig_sct_snapshot_entry = syscall_table[__NR_snapshot];
orig_sct_exit_group = syscall_table[__NR_exit_group];
syscall_table[__NR_snapshot] = &amp;sys_snapshot;
syscall_table[__NR_exit_group] = &amp;sys_exit_group;
_write_cr0(read_cr0() | (1 &lt;&lt; 16));
</code></pre></div></div>

<p>Next, we hook the two functions we need (<code class="highlighter-rouge">do_wp_page</code> and <code class="highlighter-rouge">page_add_new_anon_rmap</code>) with their respective handlers.
This uses a <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/hook.c#L15">small wrapper I wrote</a> which keeps track of all registered hooks so that we can cleanly tear them all down when the module unloads.</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if (!try_hook("do_wp_page", &amp;wp_page_hook))
...
if (!try_hook("page_add_new_anon_rmap", &amp;do_anonymous_hook))
...
</code></pre></div></div>

<p>Lastly, we call into the main snapshotting code so it can do some initialization (just grabbing some addresses out of kallsyms).</p>

<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>return snapshot_initialize_k_funcs();
</code></pre></div></div>

<p>At this point, we’re initialized, our hooks are installed, and we’re ready for a “snapshot syscall aware” program to run.</p>

<p>From this point down, there’s really very little that was changed from the original patchset.</p>

<p>The only exceptions are:</p>

<ul>
  <li>When that program calls our snapshot syscall, it hits <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/module.c#L35">the handler</a> which in turn dispatches either <code class="highlighter-rouge">make_snapshot</code> or <code class="highlighter-rouge">recover_snapshot</code>. Those functions are (IIRC) completely unmodified from the original patchset.</li>
  <li>The <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/snapshot.c#L777-L907">hooks</a> need to read out of the <code class="highlighter-rouge">pt_regs</code> passed in to grab the arguments that were actually passed to the hooked function (<a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/snapshot.c#L876-L877">example</a>).</li>
  <li>The one place which requires us to return early <a href="https://github.com/kallsyms/snapshot-lkm/blob/master/snapshot/snapshot.c#L863">overwrites the instruction pointer to a stub function</a> as described above.</li>
</ul>

<h1 id="wrapping-things-up">Wrapping Things Up</h1>

<p>When I originally wrote back to the AFL++ maintainers about this, my implementation did “work”, but only for a few seconds before the kernel would oops.
I suspected there was some locking that needs to occur that I wasn’t doing (because it’s <em>always</em> locking bugs), but I went ahead and passed this on to them, laying the groundwork for their <a href="https://github.com/AFLplusplus/AFL-Snapshot-LKM">(much improved) implementation</a>.
With that version working well, they were able to achieve &gt;3x speedup in certain target programs which (if this was a more maintainable strategy) would be a great improvement.
As they note in the README however, “due to syscall hooking and the never ending changes in the kernel we are unable to maintain it as we are busy working on libafl.”</p>

<p>Despite not being adopted, this was a very fun project to work on at the end of the day and a strategy that I feel like could be useful to other applications that need to make light modifications to the kernel.</p>]]></content><author><name>Nick Gregory</name></author><category term="security" /><summary type="html"><![CDATA[Right as the pandemic was starting in March/April 2020, I spent a couple of weekends writing a Loadable Kernel Module (LKM) for Linux, designed to add a syscall which could be used by a fuzzer to quickly restore program state instead of using a conventional fork/exec loop. This was originally suggested on the AFL++ Ideas page, and it nicely intersected a bunch of stuff I’m familiar with so I wanted to take a crack at it.]]></summary></entry><entry><title type="html">DIY Environmental Monitor</title><link href="https://www.nickgregory.me/post/2021/12/10/environmental-monitor/" rel="alternate" type="text/html" title="DIY Environmental Monitor" /><published>2021-12-10T00:00:00+00:00</published><updated>2021-12-10T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2021/12/10/environmental-monitor</id><content type="html" xml:base="https://www.nickgregory.me/post/2021/12/10/environmental-monitor/"><![CDATA[<p>Early on in the pandemic, there was a good amount of discussion on Twitter about indoor CO2 levels as more people were spending time exclusively at home, often in a single, small room for hours on end.
Since I was one of those people spending nearly the entire day in a single room, I decided to look around for a CO2 monitoring system.
While a simple “alert after levels rise above x ppm” is sufficient, I was really looking for one that would be able to log data to a remote system so that I could monitor it throughout the day and/or look back on historical data from any computer.
After being thoroughly disappointed with what was on Amazon (nothing at a reasonable price point seemed to be able to send data to a remote server), I decided it would be a nice little project to build my own.</p>

<h2 id="parts">Parts</h2>

<p>Looking around on Adafruit, I settled on the <a href="https://www.adafruit.com/product/4867">SCD-30</a> - a combined CO2, temperature, and humidity sensor.
As the Adafruit page said, this is a <a href="https://en.wikipedia.org/wiki/Nondispersive_infrared_sensor">NDIR sensor</a> so while it’s not the cheapest ($59), it’s actually measuring the CO2 in the air instead of approximating it from the concentration of volatile organic compounds (VOCs).</p>

<p>As for the “brains” of the device, I went with an <a href="https://www.adafruit.com/product/3269">ESP32-based development board</a>.
This gave me nice headers for all of the pins I’d need to get at (much like an Arduino or Raspberry Pi), but also includes WiFi out of the box so I could put it basically anywhere and not have to figure out how to get network to it.</p>

<p>I was also sure to grab the requisite <a href="https://www.adafruit.com/product/4209">cable</a> to connect the sensor to the dev board.</p>

<p>Finally, as a last-minute addition, I grabbed a cheap light sensor (the <a href="https://www.adafruit.com/product/4162">VEML7700</a>) since I figured that would also be fun to have logged. From the diagrams, it looked like it could be stacked directly on top of the SCD-30 with just some headers connecting the I2C and power pins, requiring no other changes.</p>

<h2 id="assembly">Assembly</h2>

<p>With everything in hand, the assembly was simple.
Just as planned, I was able to solder the 0.1” headers included with the light sensor to connect GND and the I2C SCL/SCA pins between the SCD-30 and the VEML7700.</p>

<p>But where does power for the VEML come from you may ask?
Well it’s a hack but the <a href="https://learn.adafruit.com/assets/73775">VEML board schematic</a> shows that the 3.3V “output” pin is a direct connection to the 3.3V “plane” of that board (including Vin for the sensor itself), so in theory it’s safe to feed 3.3V from the SCD-30 <em>into</em> the VEML7700s 3.3V “out” and ignore the voltage regulator on the VEML entirely.
With that also bridged with a header, all that was left was to connect the STEMMA cable from the SCD-30 to the ESP32 board and start writing code.</p>

<h2 id="firmware">Firmware</h2>

<p>After grabbing libraries and sample code from Adafruit for each of the sensors to make sure they were working, getting a basic “logger” working over serial output was trivial.
As I mentioned above, the ESP32 has WiFi built in though, so I decided to use that to connect to an InfluxDB instance (just on Influx’s free cloud plan right now) and log everything there.
Some more munging of sample code later, and I had basically the entire thing ready to go.</p>

<p>After getting it taped down on the side of a shelf (that should expose it to only indirect light for more accurate measurement) and let it run for a night.</p>

<p><img src="/images/environmental_sensor.jpeg" alt="The monitor all put together" /></p>

<p>However, when I checked the data the next day I found the readings were pretty far off.</p>

<p>At this point it was still July so I had the windows open most of the day which meant both that the temperature should be almost identical to what’s measured outside, and that the CO2 concentration should be ~400ppm.
The temperature I was logging was nearly 2.5degC (~4.5degF) too high, and due to how the NDIR sensor works, that discrepancy was also affecting the CO2 reading.
It looks like I’m <a href="https://forum.arduino.cc/t/scd30-on-esp32-wrong-temp-hum-calibration-issue/679237">not the only one with this issue</a>, but either way the fix was quick - there’s a built-in temperature offset value that can be set from the ESP32.</p>

<p>Even after that was changed, the CO2 reading was still a bit off so I made one last change to the firmware adding a simple HTTP server. By hitting a specific route, I could remotely force a recalibration of the SCD-30 (basically on recalibration, the sensoe assumes it’s measuring ambient outside air with a CO2 concentration of 400ppm and adjusts its internal offset as appropriate).</p>

<h2 id="conclusion">Conclusion</h2>

<p>The materials list and the full source for the firmware is available at <a href="https://github.com/kallsyms/environmental_sensor">https://github.com/kallsyms/environmental_sensor</a>.</p>]]></content><author><name>Nick Gregory</name></author><category term="electronics" /><summary type="html"><![CDATA[Early on in the pandemic, there was a good amount of discussion on Twitter about indoor CO2 levels as more people were spending time exclusively at home, often in a single, small room for hours on end. Since I was one of those people spending nearly the entire day in a single room, I decided to look around for a CO2 monitoring system. While a simple “alert after levels rise above x ppm” is sufficient, I was really looking for one that would be able to log data to a remote system so that I could monitor it throughout the day and/or look back on historical data from any computer. After being thoroughly disappointed with what was on Amazon (nothing at a reasonable price point seemed to be able to send data to a remote server), I decided it would be a nice little project to build my own.]]></summary></entry><entry><title type="html">Overkilling Website Performance</title><link href="https://www.nickgregory.me/post/2019/11/19/overkilling-website-performance/" rel="alternate" type="text/html" title="Overkilling Website Performance" /><published>2019-11-19T00:00:00+00:00</published><updated>2019-11-19T00:00:00+00:00</updated><id>https://www.nickgregory.me/post/2019/11/19/overkilling-website-performance</id><content type="html" xml:base="https://www.nickgregory.me/post/2019/11/19/overkilling-website-performance/"><![CDATA[<p>Given the <a href="https://status.cloud.google.com/incident/cloud-datastore/19006">recent</a> <a href="https://status.cloud.google.com/incident/cloud-networking/19020">series</a> <a href="https://status.cloud.google.com/incident/storage/19002">of</a> <a href="https://status.cloud.google.com/incident/cloud-networking/19009">issues</a> with Google Cloud, I decided it was time to jump ship and look at other providers for this blog (and eventually the rest of my sites most likely).</p>

<h2 id="background">Background</h2>

<p>For the past year or so, I’ve been using GCP multi-region storage buckets with CloudFlare in front (for caching and TLS) to serve all of my static sites.
I was never thrilled with the TTFB numbers I was getting out of the combo on un-cached pages however, and GCP having four relatively major outages in 6 months just pushed it over the edge for me.</p>

<p>I did some initial tests in AWS with a single S3 bucket and Cloudfront, and while results were <em>slightly</em> better, they still were not fantastic - times to load pages not in Cloudfront’s edge caches were around 60-70ms (vs ~100ms for GCS).
Keep in mind however that those numbers are from a client in NYC with the site hosted in us-east-1 - one of the best cases possible (latency-wise).
Visitors on the other side of the US would be right back to ~100ms load times, and visitors outside of the Americas would easily be 150ms+.</p>

<p>Not having much better to do one night, I decided to figure out how I could completely overkill site performance.
My main concern was performance when cache misses happened (the majority of page loads on my site due to it not getting much traffic),
and my objective was to get ~60ms response times on any page (cached or not) from anywhere in the world, not just within the US.</p>

<p>Sticking with Amazon as a provider here, there’s a few options I started exploring:</p>

<h3 id="s3--cloudfront-with-extremely-large-ttls">S3 + Cloudfront with extremely large TTLs</h3>

<p>Based on <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/HowCloudFrontWorks.html#CloudFrontRegionaledgecaches">the docs</a>, Cloudfront employs a two-layer cache.
POPs have their own, independent caches (very standard), however Cloudfront also has regional caches which, based on maps, seem to correspond to AWS regions.
If an item is not in the POP’s cache, it will reach back to the regional cache, which can then return the object from there or go back to the origin if necessary.
Since my site is not visited very frequently, it’s highly unlikely that any given page will be cached in the POP closest to the visitor (even with high TTLs), so I would be relying on regional caches keeping content basically indefinitely.
In theory this layout <em>could</em> work (assuming regional caches effectively never expire items that are within their TTL), however this is not guaranteed, and it also means I would have to explicitly invalidate a number of pages each time I make a change to the site.</p>

<h3 id="multi-region-s3--">Multi-region S3 + ???</h3>

<p>This was actually my first thought: just stick a copy of the site on each continent.
While the content replication is easy to do within S3, there don’t seem to be any ways to make origin decisions in Cloudfront based on geolocation.
The closest thing I found was to use a Lambda@Edge function to dynamically proxy the request based on geo, but then I got to thinking…
If I already need to have a function at the edge to determine where to proxy incoming requests, could I just have the function return the site itself?</p>

<p>This reminded me of <a href="https://blog.cloudflare.com/workers-sites/">a blog post by CloudFlare</a> which talks about deploying a static site to their edge using their Workers product (storing the site in their K/V store).
I was curious to see if I could do something similar on Amazon, mainly because Workers has a $5/mo minimum price which I’d rather not have to pay if necessary.</p>

<h3 id="lambdaedge">Lambda@Edge</h3>

<p>Amazon has a vaguely similar product to Workers called Lambda@Edge which, after a bit of reading, seems to be a bit of a misleading name (in my opinion).
From what I can tell (based on docs and timing), the Lambda functions (at least for “Origin Request” triggered calls) are invoked in the nearest Amazon region, <em>not</em> at the POP/edge itself.
Either way, if I can easily get the site contents stored in every Amazon region, that definitely gets me very close to the goal of delivering uncached pages in ~60ms anywhere on Earth.</p>

<p>A bonus of Lambdas that I only realized later is that the timing characteristics of Lambda functions ends up coinciding nicely with visitor usage.
If a visitor comes along and are the first ones in a while in the entire AWS region, it will take a hundred milliseconds or so for the Lambda function to start up, which, while not ideal, also isn’t the worst thing since DNS resolution, initial TCP handshare, TLS, etc. will have also taken up a bit of time.
<em>Keep in mind this all only happens in the case that Cloudfront doesn’t have the page cached, either in the POP in in the regional cache which certain pages (the home page for instance) likely will be just due to background traffic.</em>
The interesting part about Lambdas is on subsequent requests. Requests to other pages (like the visitor clicking on a blog entry) are unlikely to be cached (since Cloudfront didn’t have the prior page cached), however we now have a running Lambda instance up that can serve requests in a couple of milliseconds.
So regardless of where the user is, they will either hit a cached page in Cloudfront (taking basically round-trip time to respond), or will be proxied to a warmed Lambda instance through Cloudfront.</p>

<h2 id="comparing-performance">Comparing Performance</h2>

<h3 id="between-solutions">Between Solutions</h3>
<p>While I only ended up implementing the full Lambda@Edge solution (and so don’t have concrete numbers for the others), we can make some deductions about relative performance:</p>

<ul>
  <li>vs. multi-region S3
    <ul>
      <li>Even with a bucket in <strong>every</strong> region, the Lambda function would still have to do an intra-region request/response to fetch the content from S3</li>
      <li>If there was only one bucket per continent, there would be additional inter-region latency</li>
    </ul>
  </li>
  <li>vs. large TTLs
    <ul>
      <li>Strictly better in the case of a complete cache miss (no round trip to the origin)</li>
      <li>Only worse on first page load in a region with no running Lambda</li>
    </ul>
  </li>
</ul>

<p>There’s also a few non-performance benefits to a pure Lambda@Edge solution (vs. large TTLs):</p>
<ul>
  <li>Don’t have to deal with invalidation on every site update</li>
  <li>Eliminates the single point of failure of the origin (just in case <a href="https://aws.amazon.com/message/41926/">https://aws.amazon.com/message/41926/</a> happens again)</li>
</ul>

<h3 id="old-vs-new">Old vs. New</h3>

<p>With all of that theory discussed, let’s look at some actual measurements (taken with <a href="https://pulse.turbobytes.com/">TurboBytes Pulse</a>):</p>

<p>The old method (GCS + CloudFlare) gave mean TTFBs of <strong>~350ms</strong> on effectively every page unless it happened to be in CloudFlare’s cache.</p>

<p>The new method gives global average response times of <strong>~210ms</strong> for the first connection in the region, and subsequent loads <em>of uncached pages</em> in <strong>~70ms</strong>.</p>

<p>TTFBs of pages in POP caches average <strong>~30ms</strong> on both.</p>

<h2 id="conclusion">Conclusion</h2>

<p>With a bit more effort it should be possible to keep a Lambda instance warm in each region, which should completely eliminate the first page TTFB penalty, giving consistent uncached TTFBs of ~70ms.
And with that, I’ve basically achieved my original goal (averaging just 10ms higher than I hoped for), significantly bringing down page load times and making the blog extremely snappy.</p>

<p>Will many people notice? No. But was it fun? Absolutely.</p>]]></content><author><name>Nick Gregory</name></author><category term="sysadmin" /><summary type="html"><![CDATA[Given the recent series of issues with Google Cloud, I decided it was time to jump ship and look at other providers for this blog (and eventually the rest of my sites most likely).]]></summary></entry></feed>