<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://guysrd.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://guysrd.github.io/" rel="alternate" type="text/html" /><updated>2026-05-25T17:00:32+00:00</updated><id>https://guysrd.github.io/feed.xml</id><subtitle></subtitle><entry><title type="html">The epoll uaf</title><link href="https://guysrd.github.io/epoll-uaf" rel="alternate" type="text/html" title="The epoll uaf" /><published>2026-05-10T00:00:00+00:00</published><updated>2026-05-10T00:00:00+00:00</updated><id>https://guysrd.github.io/epoll-uaf</id><content type="html" xml:base="https://guysrd.github.io/epoll-uaf"><![CDATA[<p>A couple of weeks ago Nicholas Carlini burned an epoll uaf race in <code class="language-plaintext highlighter-rouge">fs/eventpoll.c</code>. <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=07712db80857d5d09ae08f3df85a708ecfc3b61f">Commit 07712db80857</a> changed a <code class="language-plaintext highlighter-rouge">kfree()</code> to <code class="language-plaintext highlighter-rouge">kfree_rcu()</code>. The commit message says: “eventpoll: defer struct eventpoll free to RCU grace period.”</p>

<p>That one call fixed a uaf that had been reachable from any unprivileged process for a few years on any Linux / Android running a 6.6 and above kernel with the affected optimization. 
This post is the story about the bug itself, what it gives you and my (failed) attepmts at exploiting this on a real modern device.
I spent a bit on a Pixel 10 working on this bug and in the process learned more about CFS vruntime tricks, SLUB internals, and the ARM64 memory model than I probably needed to.</p>

<hr />

<h2 id="epoll-in-2-seconds">epoll in 2 seconds</h2>

<p>If you’ve run a Linux server you’ve used epoll indirectly. It’s the kernel’s scalable I/O notification mechanism the thing that lets nginx watch tens of thousands of sockets without blocking a thread per connection. 
Three syscalls: <code class="language-plaintext highlighter-rouge">epoll_create()</code> makes an instance, <code class="language-plaintext highlighter-rouge">epoll_ctl()</code> adds or removes watched file descriptors, <code class="language-plaintext highlighter-rouge">epoll_wait()</code> blocks until something happens.</p>

<p>Linux manages everything as file, so epoll fd is itself a file descriptor. You can add an epoll to another epoll. This creates a directed graph of instances watching instances, and the kernel has validation code inside <code class="language-plaintext highlighter-rouge">epoll_ctl(ADD)</code> that walks this graph to check for cycles and depth violations, that validation code is where the bug lives.</p>

<p>epoll has a history of cves <a href="https://lore.kernel.org/all/20240527185634.056918751@linuxfoundation.org/">history of</a> <a href="https://lore.kernel.org/all/20250714230744.3710270-3-sashal@kernel.org/">CVEs</a> however, their exploitation is not documented and is very scarce.</p>

<hr />

<h2 id="structures">Structures</h2>

<p><img src="/1.svg" alt="epoll data structures and the UAF" /></p>

<p><strong><code class="language-plaintext highlighter-rouge">struct eventpoll</code></strong>: one per <code class="language-plaintext highlighter-rouge">epoll_create()</code>. Has the wait queue, the RB tree of items being watched, and <code class="language-plaintext highlighter-rouge">refs</code> at offset 176: an hlist head that links every <code class="language-plaintext highlighter-rouge">epitem</code> pointing back at this instance from somewhere else. It’s the incoming-edges list in the graph.</p>

<p><strong><code class="language-plaintext highlighter-rouge">struct epitem</code></strong>: one per (epoll instance, watched fd) pair. Has <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code>, a pointer to its owning <code class="language-plaintext highlighter-rouge">eventpoll</code>. If the watched fd is itself an epoll, this epitem is also linked into <em>that</em> epoll’s <code class="language-plaintext highlighter-rouge">refs</code> hlist via <code class="language-plaintext highlighter-rouge">fllink</code>.</p>

<p>The graph walker iterates <code class="language-plaintext highlighter-rouge">ep-&gt;refs</code>, follows <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> for each entry to reach a parent <code class="language-plaintext highlighter-rouge">eventpoll</code>, and recurses. That <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> dereference is the UAF.</p>

<hr />

<h2 id="the-2023-optimization">The 2023 Optimization</h2>

<p>Before March 2023, every <code class="language-plaintext highlighter-rouge">epoll_ctl(ADD)</code> with a nested target acquired a global mutex called <code class="language-plaintext highlighter-rouge">epmutex</code>. 
Under HTTP benchmarks, 58% of CPU time was lost to contention on it.</p>

<p>A patch replaced <code class="language-plaintext highlighter-rouge">epmutex</code> with a per-instance <code class="language-plaintext highlighter-rouge">refcount_t</code>, added a <code class="language-plaintext highlighter-rouge">dying</code> flag to <code class="language-plaintext highlighter-rouge">struct epitem</code>, and narrowed the remaining lock to only be held during actual graph walks. Throughput went up 60%.</p>

<p>The race happens in the graph walkers <code class="language-plaintext highlighter-rouge">ep_get_upwards_depth_proc</code> and <code class="language-plaintext highlighter-rouge">reverse_path_check_proc</code>. Both functions iterate <code class="language-plaintext highlighter-rouge">ep-&gt;refs</code> under <code class="language-plaintext highlighter-rouge">rcu_read_lock()</code> while other threads tear down the structures they’re pointing at. The old <code class="language-plaintext highlighter-rouge">epmutex</code> had been incidentally serializing this, but the new optimization was too open and nobody noticed the walkers race. The reason is they don’t touch any of the data the mutex was nominally protecting, they were only reading data.</p>

<hr />

<h2 id="the-bug">The Bug</h2>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">ep_loop_check</span><span class="p">(</span><span class="k">struct</span> <span class="n">eventpoll</span> <span class="o">*</span><span class="n">ep</span><span class="p">,</span> <span class="k">struct</span> <span class="n">eventpoll</span> <span class="o">*</span><span class="n">to</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">depth</span><span class="p">,</span> <span class="n">upwards_depth</span><span class="p">;</span>

	<span class="n">inserting_into</span> <span class="o">=</span> <span class="n">ep</span><span class="p">;</span>
	<span class="cm">/*
	 * Check how deep down we can get from @to, and whether it is possible
	 * to loop up to @ep.
	 */</span>
	<span class="n">depth</span> <span class="o">=</span> <span class="n">ep_loop_check_proc</span><span class="p">(</span><span class="n">to</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">depth</span> <span class="o">&gt;</span> <span class="n">EP_MAX_NESTS</span><span class="p">)</span>
		<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
	<span class="cm">/* Check how far up we can go from @ep. */</span>
	<span class="n">rcu_read_lock</span><span class="p">();</span>
	<span class="n">upwards_depth</span> <span class="o">=</span> <span class="n">ep_get_upwards_depth_proc</span><span class="p">(</span><span class="n">ep</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
	<span class="n">rcu_read_unlock</span><span class="p">();</span>

	<span class="k">return</span> <span class="p">(</span><span class="n">depth</span><span class="o">+</span><span class="mi">1</span><span class="o">+</span><span class="n">upwards_depth</span> <span class="o">&gt;</span> <span class="n">EP_MAX_NESTS</span><span class="p">)</span> <span class="o">?</span> <span class="o">-</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="p">..</span>
<span class="n">snip</span>
<span class="p">..</span>


<span class="k">static</span> <span class="kt">int</span> <span class="nf">ep_get_upwards_depth_proc</span><span class="p">(</span><span class="k">struct</span> <span class="n">eventpoll</span> <span class="o">*</span><span class="n">ep</span><span class="p">,</span> <span class="kt">int</span> <span class="n">depth</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">result</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">epitem</span> <span class="o">*</span><span class="n">epi</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">gen</span> <span class="o">==</span> <span class="n">loop_check_gen</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">ep</span><span class="o">-&gt;</span><span class="n">loop_check_depth</span><span class="p">;</span>

    <span class="n">hlist_for_each_entry_rcu</span><span class="p">(</span><span class="n">epi</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">refs</span><span class="p">,</span> <span class="n">fllink</span><span class="p">)</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">ep_get_upwards_depth_proc</span><span class="p">(</span><span class="n">epi</span><span class="o">-&gt;</span><span class="n">ep</span><span class="p">,</span> <span class="n">depth</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">ep</span><span class="o">-&gt;</span><span class="n">gen</span> <span class="o">=</span> <span class="n">loop_check_gen</span><span class="p">;</span>
    <span class="n">ep</span><span class="o">-&gt;</span><span class="n">loop_check_depth</span> <span class="o">=</span> <span class="n">result</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">ep_get_upwards_depth_proc</code> runs under <code class="language-plaintext highlighter-rouge">rcu_read_lock()</code>. Each <code class="language-plaintext highlighter-rouge">epitem</code> is safe when unlinked, it’s freed via <code class="language-plaintext highlighter-rouge">call_rcu()</code>, so RCU keeps it alive through the read-side critical section. There’s even a comment in the source that acknowledges the RCU reader:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* The rcu read side, reverse_path_check_proc(), does not make
 * use of the rbn field.
 */</span>
<span class="n">call_rcu</span><span class="p">(</span><span class="o">&amp;</span><span class="n">epi</span><span class="o">-&gt;</span><span class="n">rcu</span><span class="p">,</span> <span class="n">epi_rcu_free</span><span class="p">);</span>
</code></pre></div></div>

<p>That comment is correct about the <code class="language-plaintext highlighter-rouge">epitem</code>. It says nothing about what <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> points to.</p>

<p>Now look at the teardown path:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">ep_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">eventpoll</span> <span class="o">*</span><span class="n">ep</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">mutex_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">mtx</span><span class="p">);</span>
    <span class="n">free_uid</span><span class="p">(</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">user</span><span class="p">);</span>
    <span class="n">wakeup_source_unregister</span><span class="p">(</span><span class="n">ep</span><span class="o">-&gt;</span><span class="n">ws</span><span class="p">);</span>
    <span class="n">kfree</span><span class="p">(</span><span class="n">ep</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kfree()</code>. Immediate. No RCU grace period.</p>

<p>The walker loads <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> a pointer read, then dereferences the target but that <code class="language-plaintext highlighter-rouge">eventpoll</code> may have already been freed and reused by a completely different <code class="language-plaintext highlighter-rouge">kmalloc-256</code> allocation.</p>

<hr />

<h2 id="triggering-it">Triggering it</h2>

<p><img src="/2.svg" alt="The race timeline" /></p>

<p>I initially tried two threads on different CPUs, one walking the graph one closing an epoll fd, it didn’t work. The window between loading <code class="language-plaintext highlighter-rouge">epi</code> from the hlist and following <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> is a handful of ARM64 instructions.
What does work is same-CPU preemption. The Frankel device I was testing on runs <code class="language-plaintext highlighter-rouge">CONFIG_PREEMPT=y</code> and <code class="language-plaintext highlighter-rouge">CONFIG_PREEMPT_RCU=y</code>, which means <code class="language-plaintext highlighter-rouge">rcu_read_lock()</code> just bumps a per-task counter it doesn’t disable preemption. A timer tick during the walk can yield the CPU to the closer thread even though the walker is mid-RCU.</p>

<p>Just to give you a sense on numbers (<code class="language-plaintext highlighter-rouge">CONFIG_HZ=250</code>, tick every 4 ms):</p>

<ul>
  <li>4,096 parents: walk takes ~400 us. Rarely overlaps a tick.</li>
  <li>8,000 parents: ~2 ms. Overlaps reliably. About 4% hit rate per attempt.</li>
</ul>

<p>If the closer thread busy waits for the trigger signal, the scheduler treats it the same priority as the walker and never switches, but if you add the closer <code class="language-plaintext highlighter-rouge">usleep(1000)</code> in a loop while waiting. Sleeping threads get scheduling priority when they wake and the scheduler preempts the walker immediately.</p>

<p>The Pixel’s default governor throttles to 729 MHz at idle, at that frequency the traversal timing shifts enough that the race stops firing entirely :’)</p>

<hr />

<h2 id="what-gets-written">What Gets Written</h2>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct eventpoll {
    struct mutex               mtx;                  /*     0    48 */
    wait_queue_head_t          wq;                   /*    48    24 */
    wait_queue_head_t          poll_wait;            /*    72    24 */
    struct list_head           rdllist;              /*    96    16 */
    rwlock_t                   lock;                 /*   112     8 */
    struct rb_root_cached      rbr;                  /*   120    16 */
    struct epitem *            ovflist;              /*   136     8 */
    struct wakeup_source *     ws;                   /*   144     8 */
    struct user_struct *       user;                 /*   152     8 */
    struct file *              file;                 /*   160     8 */
    u64                        gen;                  /*   168     8 */ /* read, then WRITE loop_check_gen */
    struct hlist_head          refs;                 /*   176     8 */ /* READ as hlist pointer           */
    u8                         loop_check_depth;     /*   184     1 */ /* WRITE 0 or a kernel pointer     */
    refcount_t                 refcount;             /*   188     4 */
    unsigned int               napi_id;             /*   192     4 */

    /* size: 200, cachelines: 4, members: 15 */
};
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">struct eventpoll</code> lives in <code class="language-plaintext highlighter-rouge">kmalloc-256</code> (order-1 slabs, 32 objects per slab, <code class="language-plaintext highlighter-rouge">cpu_partial=52</code> on this device). <code class="language-plaintext highlighter-rouge">init_on_free=1</code> is set by default on Frankel devices and Android adds custom padding at the end of each object therefore the structure is different from mainline linux a bit.</p>

<p>Since the traversal of <code class="language-plaintext highlighter-rouge">refs.first</code> is at offset 176, this is our target offset, which is critical as my main attempt to exploit this as a one shot w/o any infoleaks:</p>

<p>If it’s <strong>zero</strong> (the <code class="language-plaintext highlighter-rouge">init_on_free</code> case), the hlist looks empty. The walker skips the loop, writes <code class="language-plaintext highlighter-rouge">loop_check_gen</code> at 168 and a zero byte at 184, returns. Silent corruption of 9 bytes in whatever object gets reused.</p>

<p>If it’s <strong>nonzero</strong>, the walker follows it as a pointer to an <code class="language-plaintext highlighter-rouge">epitem</code>, computes <code class="language-plaintext highlighter-rouge">container_of()</code>, dereferences <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code>, and recurses into wherever that points. This is an arbitrary write primitive.</p>

<p>If you can grab the object where you control offset 176, you steer the recursion. Each level writes <code class="language-plaintext highlighter-rouge">loop_check_gen</code> (a global u64 counter that increments per <code class="language-plaintext highlighter-rouge">epoll_ctl(ADD)</code>) and a zero byte at fixed offsets from the pointer target. That’s a constrained write primitive. What you do with it from there depends on what <code class="language-plaintext highlighter-rouge">kmalloc-256</code> object you use for reclaim, and how creative you’re feeling.</p>

<p>Note:
There are other paths I did not include in this blog. One of them leads to <code class="language-plaintext highlighter-rouge">mutex_unlock</code> that if you are careful and brave enough to walk into. They require tremendous memory pressure and some of them might be fruitful.
Trivia: We also control and <code class="language-plaintext highlighter-rouge">gen</code> and <code class="language-plaintext highlighter-rouge">loop_check_depth</code> which allows to zero out (or write somewhat deterministically yet very slowly) a controlled value to the freed chunk.</p>

<hr />

<h2 id="can-you-cross-cache-this">Can You Cross Cache This?</h2>

<p>I wanted to exploit this vuln as one shot primitive and wanted to do this using PTE corruption, my attempts failed, but this was my strategy. 
If I were to infoleak, I’d use a different primitive and then solve everything pretty easily with <code class="language-plaintext highlighter-rouge">refs.first</code> as a pointer.
Note: This part is technical. If you are not familiar with PCPs, Page Table Entries or SLUB / Buddy internals, I encourage you to read about them before you try reading this part.</p>

<p>The freed objects goes into <code class="language-plaintext highlighter-rouge">kmalloc-256</code> and uses order-1 slabs. ARM64 PTE pages are order-0 (4 KB). These sit on different PCP freelists. The order-1 page freed from the slab cache won’t satisfy an order-0 PTE request unless PCP overflows and buddy splits it. Arranging that overflow during the narrow race window turned out to be non-trivial. It was possible to perform the split w/o invoking the race, however, integrating both pieces together was never a succeess.</p>

<p>These pieces work separately. Shaping 244 out of 250 slab pages go to buddy with 16 children forking and faulting 8 GB each, all available UNMOVABLE order-1 gets split for PTE allocations. The slab2buddy transition works, the buddy2PTE transition works, the problem is combining them with the race. The walker finishes in about 2 ms. The full cross cache pipeline, SLUB discard, PCP drain, buddy insertion, PTE allocation with <code class="language-plaintext highlighter-rouge">__GFP_ZERO</code> takes on the order of 100 ms. The gen write needs to land on a physical page that has <em>already</em> completed the transition from slab to PTE, and those timelines don’t overlap. I couldn’t find a way to stretch the walk long enough without resorting to <code class="language-plaintext highlighter-rouge">SCHED_FIFO</code> or similar privileged tricks, which defeats the purpose.</p>

<p>Same-cache reclaim ignores this entirely. SLUB’s per-CPU freelist is LIFO: last freed, first allocated. An immediate <code class="language-plaintext highlighter-rouge">kmalloc(256)</code> on the same CPU gets you the exact slot. The hard part is finding a <code class="language-plaintext highlighter-rouge">kmalloc-256</code> object with a useful layout at offsets 168 and 176, I did not invest too much time into this.</p>

<hr />

<h2 id="the-fix">The Fix</h2>

<p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=07712db80857d5d09ae08f3df85a708ecfc3b61f">Commit 07712db80857</a>:</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code> static void ep_free(struct eventpoll *ep)
 {
     mutex_destroy(&amp;ep-&gt;mtx);
     free_uid(ep-&gt;user);
     wakeup_source_unregister(ep-&gt;ws);
<span class="gd">-    kfree(ep);
</span><span class="gi">+    kfree_rcu(ep, rcu);
</span> }
</code></pre></div></div>

<p>The fix adds a <code class="language-plaintext highlighter-rouge">struct rcu_head</code> to <code class="language-plaintext highlighter-rouge">eventpoll</code>. <code class="language-plaintext highlighter-rouge">kfree_rcu()</code> defers the free until the RCU grace period ends. Since the walker holds <code class="language-plaintext highlighter-rouge">rcu_read_lock()</code>, the grace period can’t complete until it’s done.</p>

<hr />

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>What stays with me about this bug isn’t the race condition or the allocator internals. It’s how much work it takes to understand which code paths in epoll are protected by what. Wait queue locks serialize callbacks file refcounts gate <code class="language-plaintext highlighter-rouge">ep_free</code>. <code class="language-plaintext highlighter-rouge">__fput</code> sequences cleanup. <code class="language-plaintext highlighter-rouge">call_rcu</code> defers <code class="language-plaintext highlighter-rouge">epitem</code> frees. Each mechanism covers something. You have to hold all of them in your head at once before you can point at <code class="language-plaintext highlighter-rouge">epi-&gt;ep</code> and be sure that nothing is keeping the target alive. I spent several days just on that part.</p>

<p>I encourage anyone to try to exploit this on a modern Android system, it sounds fun and I’d be interested to see how u managed to
get a stable arb read and write primitives based on this bug.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A couple of weeks ago Nicholas Carlini burned an epoll uaf race in fs/eventpoll.c. Commit 07712db80857 changed a kfree() to kfree_rcu(). The commit message says: “eventpoll: defer struct eventpoll free to RCU grace period.”]]></summary></entry></feed>