futex: remove_waiter stack uaf

This post describes a stack uaf in the Linux futex subsystem, sitting in the code since 2011 and patched in April 2026. It allowed any adversary with an untrusted SELinux context to elevate privileges with the right magic. I did not reach code execution with it, I only managed to trigger it and shared my thoughts on the journey, I hope you’d like it.

LLM Technical summary

A stack use after free in `remove_waiter()` in `kernel/locking/rtmutex.c`. The function clears `current->pi_blocked_on` but on the proxy lock path (via `FUTEX_CMP_REQUEUE_PI`), `current` is the requeuer, not the waiter. The waiter's `pi_blocked_on` is left dangling into a popped stack frame. Any subsequent PI chain walk through the waiter dereferences the stale pointer. Introduced in v2.6.38 (commit `8161239a8bcc`, January 2011). Fixed in commit `3bfdc63936dd` (April 2026), authored by Keenan Dong, committed by Thomas Gleixner. Backported to stable 6.1.175, 6.6.140, 6.12.86, 6.18.27. Not backported to 5.15 or 5.10. Android GKI android14 6.1 and android15 6.6 patched June 8, 2026. Android13 5.10 and 5.15 remain unpatched. Trigger: three threads, two PI futexes, force a deadlock cycle via requeue so `task_blocks_on_rt_mutex` returns `-EDEADLK` and `remove_waiter` runs in the requeuer's context. The dangling `pi_blocked_on->lock` (offset 88 in `struct rt_mutex_waiter`, 112 bytes total) is read by `task_blocked_on_lock` during any future chain walk. Stack spray controls the data.

Following my last post on the requeue_pi_wake_futex I kept staring at the same code. If you need more information about futex, I encourage you to read my previous post, Elon’s towleroot posts and futex internals. PI futex requeue is one of those weird kernel wizards where a single helper has two completely different kinds of callers: the task itself on the slowlock path, and somebody else acting on the task’s behalf on the proxy path. Any time you see current variable referenced inside a helper that’s reachable from both, you’re looking at a candidate bug.

So I tried my luck :’) I once again ask you to read your internals about the futex subsystem before diving into this post. I found one candidate in the cleanup path of rt_mutex_start_proxy_lock that fits the pattern exactly, it was fixed upstream in commit 3bfdc63936dd “rtmutex: Use waiter::task instead of current in remove_waiter()”, authored by Keenan Dong and committed by Thomas Gleixner.

The bug is a stack uaf, task_struct->pi_blocked_on gets left pointing at a struct rt_mutex_waiter that lives on the waiter task’s kernel stack. When the waiter returns from its syscall the stack frame is popped, but pi_blocked_on keeps pointing at the slot and the slot’s bytes are immediately reusable by the task’s next syscall. Any future PI chain walk through the task dereferences the dangling pointer. I fiddled a bit with this bug.

rt_mutex and PI futexes in 30 seconds

A PI futex is a futex with an rt_mutex stapled to it. The userspace word holds the owner TID the kernel-side struct futex_pi_state wraps an rt_mutex_base and keeps it in sync with the user word.

rt_mutex is the kernel’s priority inheritance mutex implementation. When a high priority waiter blocks on a lock held by a lower priority owner, the owner gets boosted to the waiter’s priority until it releases. The boost walks the chain: if the owner is itself waiting on another lock, that lock’s owner gets boosted too, all the way until we hit a runnable task. Honestly, rt_mutex_adjust_prio_chain — this code is very complex and I had a very hard time reading it, to this day I don’t understand it.

This is where the proxy pattern enters. FUTEX_CMP_REQUEUE_PI is the op that pthread_cond_broadcast and pthread_cond_signal use under the hood. Their POSIX declarations:

#include <pthread.h>

int pthread_cond_broadcast(pthread_cond_t *cond);
int pthread_cond_signal(pthread_cond_t *cond);

pthread_cond_broadcast unblocks all threads waiting on the condvar, pthread_cond_signal unblocks at least one. Under the hood glibc’s NPTL (nptl/pthread_cond_signal.c) maps these onto the futex syscall:

/* include/uapi/linux/futex.h */
#define FUTEX_WAIT_REQUEUE_PI   11
#define FUTEX_CMP_REQUEUE_PI    12

FUTEX_WAIT_REQUEUE_PI is the waiter side — sleep on the condvar futex, expecting to be requeued onto a PI mutex later. FUTEX_CMP_REQUEUE_PI is the broadcaster side — atomically move waiters from the condvar to the PI mutex.

A thread that called pthread_cond_wait is asleep on the condvar’s futex. When somebody calls broadcast, the kernel has to move that waiter onto the condvar’s associated PI mutex’s wait queue without waking it first. The requeuer does that work, in its own context, on the waiter’s behalf:

sequenceDiagram
    participant W as Waiter
    participant R as Requeuer
    participant K as Kernel

    W->>K: pthread_cond_wait → FUTEX_WAIT_REQUEUE_PI(condvar, mutex)
    Note over W: parked on condvar futex, asleep
    R->>K: pthread_cond_broadcast → FUTEX_CMP_REQUEUE_PI(condvar, mutex)
    K->>K: rt_mutex_start_proxy_lock(mutex, waiter):<br/>enqueue Waiter on mutex->waiters<br/>(Requeuer's context, Waiter's behalf)
    K-->>R: return
    Note over W: still asleep,<br/>now a waiter on mutex

The kernel helper for “do an rt_mutex enqueue on another task’s behalf” is rt_mutex_start_proxy_lock:

int rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
                              struct rt_mutex_waiter *waiter,
                              struct task_struct *task)
{
    int ret;
    raw_spin_lock_irq(&lock->wait_lock);
    ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
    if (unlikely(ret))
        remove_waiter(lock, waiter);
    raw_spin_unlock_irq(&lock->wait_lock);
    return ret;
}

task is the waiter, passed explicitly. current is whoever is calling the requeuer in the futex path. The whole story is what happens when task != current and the cleanup code forgets which one it’s supposed to be touching.

Triggering

This is confusing and techie, so bear with me. I hope the drawing can help you. In order to trigger this bug we spawn (at least) three threads with two PI futexes:

Holder owns the target rt_mutex (the one behind uaddr2).
Waiter owns a second PI futex called other.
Holder is blocked on other so the kernel’s PI graph already knows “Holder wants other, owned by Waiter”.
Waiter is parked in FUTEX_WAIT_REQUEUE_PI(uaddr1, ..., uaddr2), sleeping until somebody moves it onto target.
Requeuer calls FUTEX_CMP_REQUEUE_PI(uaddr1, ..., uaddr2) to do the move.

When the kernel enqueues Waiter on target and walks the PI chain Waiter → target → Holder → other → Waiter it spots the cycle, returns -EDEADLK and that’s the path that calls remove_waiter() with the wrong current.

sequenceDiagram
    participant W as Waiter (CPU 0)
    participant R as Requeuer (CPU 1)

    Note over W: parked in futex_wait_requeue_pi<br/>rt_waiter lives on Waiter's kstack
    R->>R: futex_requeue / proxy_lock for Waiter
    R->>W: task_blocks_on_rt_mutex:<br/>Waiter.pi_blocked_on = &rt_waiter
    R->>R: chain walk → cycle → -EDEADLK
    R->>R: remove_waiter():<br/>clears current.pi_blocked_on<br/>current == Requeuer (wrong task)
    Note over W: Waiter.pi_blocked_on STILL = &rt_waiter
    W->>W: wakes, takes IGNORE path,<br/>returns from syscall, kstack pops
    Note over W: Waiter.pi_blocked_on dangles<br/>into freed stack frame

A few moments later any other thread (Probe) that blocks on a lock Waiter owns will trigger the chain walker to call task_blocked_on_lock(Waiter) which dereferences Waiter.pi_blocked_on->lock. The shtick is that the address itself is still valid because the Waiter’s stack is allocated for as long as Waiter is alive but the data isn’t rt_mutex_waiter :’) That frame got popped when Waiter returned from futex_wait_requeue_pi, and the same stack region is now scratch space for whatever syscall Waiter has run since, this is crucial to understand and remember. Here’s how these structures actually look (pahole, v6.6.138, x86_64):

struct rt_mutex_waiter {
    struct rt_waiter_node  tree;           /*     0    40 */
    struct rt_waiter_node  pi_tree;        /*    40    40 */
    struct task_struct *   task;           /*    80     8 */
    struct rt_mutex_base * lock;           /*    88     8 */  /* ← the UAF read */
    unsigned int           wake_state;     /*    96     4 */
    /* 4 bytes hole */
    struct ww_acquire_ctx * ww_ctx;        /*   104     8 */

    /* size: 112, cachelines: 2 */
};

struct rt_mutex_base {
    raw_spinlock_t         wait_lock;      /*     0     4 */
    /* 4 bytes hole */
    struct rb_root_cached  waiters;        /*     8    16 */
    struct task_struct *   owner;          /*    24     8 */

    /* size: 32 */
};

The chain walker dereferences these bytes as if they were still a struct rt_mutex_waiter and takes the lock field at offset 88, and that becomes the next_lock it follows :’) Spray the Waiter’s “next syscall” at one whose kernel frame plants attacker bytes at the rt_waiter offset, and you control what the chain walker reads. You need an infoleak to win here, or be smarter in turning this bug into an infoleak primitive.

int __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
                                struct rt_mutex_waiter *waiter,
                                struct task_struct *task)
{
    int ret;
    lockdep_assert_held(&lock->wait_lock);

    if (try_to_take_rt_mutex(lock, task, NULL))
        return 1;

    ret = task_blocks_on_rt_mutex(lock, waiter, task, NULL,
                                  RT_MUTEX_FULL_CHAINWALK);

    if (ret && !rt_mutex_owner(lock)) {
        /* the owner went away while we were chain walking — call it success */
        ret = 0;
    }

    return ret;
}

The interesting call is task_blocks_on_rt_mutex when FULL_CHAINWALK is set, several things happen:

Set task->pi_blocked_on = waiter and waiter->task = task.
Enqueue the waiter into lock->waiters (rbtree).
Walk the PI chain, if a cycle is detected return -EDEADLK.

Step 1 is where the dangling pointer originates. Step 3 is where it goes wrong.

When the chain walk returns -EDEADLK, the wrapper takes the cleanup branch (v6.6.138, kernel/locking/rtmutex_api.c:339):

int __sched rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
				      struct rt_mutex_waiter *waiter,
				      struct task_struct *task)
{
	int ret;

	raw_spin_lock_irq(&lock->wait_lock);
	ret = __rt_mutex_start_proxy_lock(lock, waiter, task);
	if (unlikely(ret))
		remove_waiter(lock, waiter);
	raw_spin_unlock_irq(&lock->wait_lock);

	return ret;
}

and calls remove_waiter (v6.6.138, kernel/locking/rtmutex.c:1515):

static void __sched remove_waiter(struct rt_mutex_base *lock,
				  struct rt_mutex_waiter *waiter)
{
	bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock));
	struct task_struct *owner = rt_mutex_owner(lock);
	struct rt_mutex_base *next_lock;

	lockdep_assert_held(&lock->wait_lock);

	raw_spin_lock(&current->pi_lock);
	rt_mutex_dequeue(lock, waiter);
	current->pi_blocked_on = NULL;
	raw_spin_unlock(&current->pi_lock);

	/*
	 * Only update priority if the waiter was the highest priority
	 * waiter of the lock and there is an owner to update.
	 */
	if (!owner || !is_top_waiter)
		return;

	raw_spin_lock(&owner->pi_lock);

	rt_mutex_dequeue_pi(owner, waiter);

	if (rt_mutex_has_waiters(lock))
		rt_mutex_enqueue_pi(owner, rt_mutex_top_waiter(lock));

	rt_mutex_adjust_prio(lock, owner);

	/* Store the lock on which owner is blocked or NULL */
	next_lock = task_blocked_on_lock(owner);

	raw_spin_unlock(&owner->pi_lock);

	/*
	 * Don't walk the chain, if the owner task is not blocked
	 * itself.
	 */
	if (!next_lock)
		return;

	/* gets dropped in rt_mutex_adjust_prio_chain()! */
	get_task_struct(owner);

	raw_spin_unlock_irq(&lock->wait_lock);

	rt_mutex_adjust_prio_chain(owner, RT_MUTEX_MIN_CHAINWALK, lock,
				   next_lock, NULL, current);

	raw_spin_lock_irq(&lock->wait_lock);
}

current here is the requeuer. We’re locking the requeuer’s pi_lock and clearing the requeuer’s pi_blocked_on. The waiter’s pi_blocked_on keeps pointing at the on-stack rt_waiter that the wrapper just dequeued.

When does this matter? When the waiter task returns from its syscall, its kernel stack rewinds. The rt_waiter is gone task->pi_blocked_on is now pointing into undefined memory :’)

The fix is simple, but requires understanding this whole clusterfuck. They replace current with waiter->task:

struct task_struct *waiter_task = waiter->task;
...
raw_spin_lock(&waiter_task->pi_lock);
rt_mutex_dequeue(lock, waiter);
waiter_task->pi_blocked_on = NULL;
raw_spin_unlock(&waiter_task->pi_lock);

Gaining primitives

pi_blocked_on is read by anything that walks the PI chain through Waiter.

The most direct consumer is task_blocked_on_lock:

static struct rt_mutex_base *task_blocked_on_lock(struct task_struct *p)
{
    return p->pi_blocked_on ? p->pi_blocked_on->lock : NULL;
}

Called from task_blocks_on_rt_mutex and rt_mutex_adjust_prio_chain. Everytime some new task (Probe) blocks on a rt_mutex whose owner is Waiter, the chain walk reads Waiter->pi_blocked_on->lock to figure out where to walk next. That’s the use-after-free read.

sequenceDiagram
    participant P as Probe (new blocker)
    participant K as Kernel (chain walk)
    participant W as Waiter (stale pi_blocked_on)

    P->>K: FUTEX_LOCK_PI on a lock Waiter owns
    K->>K: task_blocks_on_rt_mutex → walk the PI chain
    K->>W: owner is Waiter → task_blocked_on_lock(Waiter)
    Note over W: read Waiter->pi_blocked_on->lock<br/>pi_blocked_on dangles into Waiter's freed stack slot
    W-->>K: returns attacker-controlled next_lock
    Note over K: chain walk follows the fake rt_mutex

We can spray and grab this object. When Waiter returns, the stack rewinds without zeroing any variable. Waiter’s next syscall starts from the same stack top and depending on the syscall, the scratch space in that new frame overlaps the old rt_waiter slot with bytes the user supplies.

On a 6.6.138 build (CONFIG_INIT_STACK_ALL_ZERO=y, CONFIG_VMAP_STACK=n), the lock field of the on-stack rt_waiter lands at a fixed offset: THREAD_TOP - 0x208. The layout is deterministic per build. I encourage you to shape your layout and land on this field, rest is up to you to continue :’)

I tested this on Qemu and on a Frankel device, here’s the panic I received on Qemu:

BUG: KASAN: wild-memory-access in do_raw_spin_trylock+0x69/0x120
Read of size 4 at addr 1ffff1100167efa3 by task poc/61
CPU: 0 PID: 61 Comm: poc Not tainted 6.6.138 #4
Call Trace:
 <TASK>
 kasan_report+0xd8/0x110
 kasan_check_range+0x105/0x1b0
 do_raw_spin_trylock+0x69/0x120
 ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x29d/0xb10
 _raw_spin_trylock+0x19/0x70
 rt_mutex_adjust_prio_chain.isra.0+0x120/0x1640
 task_blocks_on_rt_mutex.constprop.0.isra.0+0x390/0xb10
 __rt_mutex_start_proxy_lock+0x61/0xa0
 futex_lock_pi+0x31f/0x5a0
 do_futex+0xa6/0x230
 __x64_sys_futex+0x1b8/0x2b0
 do_syscall_64+0x39/0x90
 entry_SYSCALL_64_after_hwframe+0x78/0xe2

The call chain is exactly the chain walker reaching Waiter->pi_blocked_on->lock->wait_lock. KASAN flags it wild memory access because the freed-stack address has no shadow.

Same call chain on the non-KASAN build oopses at _raw_spin_trylock with the planted sentinel in the registers:

general protection fault, probably for non-canonical address 0x4141414141414141: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:_raw_spin_trylock+0x10/0x50
RAX: 0000000000000078 RBX: ffff888043211040 RCX: 4141414141414141
RDX: 0000000000000001 RSI: 0000000000000400 RDI: 4141414141414141
R15: 4141414141414141
 rt_mutex_adjust_prio_chain+0x9a/0x8f0
 task_blocks_on_rt_mutex.constprop.0+0x1c4/0x3c0
 __rt_mutex_start_proxy_lock+0x4d/0x70
 futex_lock_pi+0x25d/0x480

Unusual primitive ideas

The web is filled with a few rt_mutex_base exploitation techniques, these are my thoughts. rt_mutex_base looks as follows:

struct rt_mutex_base {
    raw_spinlock_t  wait_lock;   /* offset 0 */
    struct rb_root_cached waiters; /* offset 8..23 */
    struct task_struct *owner;   /* offset 24 */
};

If the chunk’s first 4 bytes are zero (looks like an unlocked spinlock), the trylock succeeds and the walk proceeds reads waiters (rb tree), reads owner (next task to walk into). If they’re non-zero, trylock fails and the walk exits cleanly. If the address is unmapped, Probe (the task triggering the chain walk) takes a page fault, oopses, you die.

The fix

 static void __sched remove_waiter(struct rt_mutex_base *lock,
                                   struct rt_mutex_waiter *waiter)
 {
+    struct task_struct *waiter_task = waiter->task;
     bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock));
     struct task_struct *owner = rt_mutex_owner(lock);
     struct rt_mutex_base *next_lock;

     lockdep_assert_held(&lock->wait_lock);

-    raw_spin_lock(&current->pi_lock);
+    raw_spin_lock(&waiter_task->pi_lock);
     rt_mutex_dequeue(lock, waiter);
-    current->pi_blocked_on = NULL;
-    raw_spin_unlock(&current->pi_lock);
+    waiter_task->pi_blocked_on = NULL;
+    raw_spin_unlock(&waiter_task->pi_lock);
     ...
 }

Conclusions

This is a very rare and interesting bug imo, a helper gets written first for the obvious caller, where “the task we’re operating on” and current happen to coincide. Later someone adds a proxy callsite a function that does the operation on behalf of someone else and the helper keeps reaching for current because nobody flagged it. Lockdep is happy: it cares about the spinlock, not whose pi_lock it actually is. The function still does something when called, and on the slowlock path that something is correct. It’s only on the proxy path that the wrong pi_lock and the wrong pi_blocked_on quietly get written.

The previous bug in this same neighborhood the missing READ_ONCE(q->task) in requeue_pi_wake_futex was the same family. There “the task that owns the q” was conflated with “the task currently reading q”. Here it’s “the task we’re cleaning up after” conflated with “current”. Both load assumptions, both invisible until you ask the question explicitly.

I find this bug pretty unusual and novel, to whoever managed to exploit this reliably :’) Identifying the confused current in such a complex environment and triggering this path to grab the object and reach an arbitrary read and write is novel to me.

If anyone takes it further I’d love to hear your opinion about this bug. I hope you enjoyed this post.

The futex READ_ONCE

A futex is a 32b integer in userspace memory. Uncontended operations are pure userspace atomic ops the kernel is only involved when someone needs to sleep or wake up. PI futexes add priority inheritance, futex word stores the owner’s TID, and the kernel boosts the holder’s priority when a higher priority thread is waiting.

If you’ve ever done pwnable.kr and worked on towelroot, you probably remember FUTEX_CMP_REQUEUE_PI. This is how pthread_cond_signal works under the hood. The FUTEX_CMP_REQUEUE_PI syscall takes two userspace addresses: uaddr1 is the condition variable’s futex and uaddr2 is the PI mutex’s futex. A waiter sleeps on uaddr1 and when signaled the kernel moves it to the wait queue behind uaddr2. If the mutex is uncontested, the kernel can acquire it atomically on behalf of the waiter and skip the requeue entirely. That fast path “lock acquired atomically, just wake the waiter” is where the bug is.

It is very rare to see any bugs in this subsystem, having great maintainers like Thomas Gleixner that understand the mechanism deep enough requires deep understanding of futex itself. it took me months to dive into them and I sometime just gave up.

LLM Technical summary

A stack use after free caused by a missing `READ_ONCE` in `requeue_pi_wake_futex` in the Linux kernel futex subsystem. The function signals the waiter via `futex_requeue_pi_complete(q, 1)` (atomic store) then reads `q->task` on the next line, but `q` is a `struct futex_q` on the waiter's kernel stack. After the atomic store the waiter can see the LOCKED state, skip PI fixup (because `q->pi_state` is NULL on the atomic trylock path), and return from its syscall before the requeuer reads `q->task`. The requeuer then dereferences a dead stack frame. The race is one instruction wide. Triggerable from unprivileged userspace using TLB shootdown IPIs via `mprotect` on a third CPU to interrupt the requeuer between the atomic store and the pointer load. The waiter's next syscall can spray controlled data at the old `q->task` offset. `wake_up_state` calls `try_to_wake_up` on the fake `task_struct` pointer, which under the right field layout reaches `enqueue_task_fair` and performs `cfs_rq->load.weight += se->load.weight`, an 8 byte addition through a controlled pointer with a controlled value. The fix adds `task = READ_ONCE(q->task)` before the `futex_requeue_pi_complete` store, capturing the pointer while the stack frame is still live.

The bug

Our post today focuses on a bug that is caused by a missing READ_ONCE in a function called requeue_pi_wake_futex, however, unlike classical use after free bugs that are on the heap our bug is caused by an esoteric state on the stack.

Here is what happens during a pthread_cond_signal with a PI mutex. Two threads are involved, a waiter and a requeuer:

  waiter                                    requeuer
  ──────                                    ────────
  pthread_cond_wait(cond, mutex)
    futex(FUTEX_WAIT_REQUEUE_PI,
          cond, mutex)
      enqueue on cond's wait queue
      go to sleep, blocked in kernel
      ...                                   pthread_cond_signal(cond)
      ...                                     futex(FUTEX_CMP_REQUEUE_PI,
      ...                                           cond, mutex)
      ...                                       try to acquire mutex for waiter
      ...                                       if acquired:
      ...                                         requeue_pi_wake_futex(q)
      ...                                           signal waiter, wake it up
      wakes up, returns to userspace

The waiter is the thread that called pthread_cond_wait. It enters the kernel and goes to sleep on the condition variable’s futex. The requeuer is the thread that called pthread_cond_signal. It enters the kernel and tries to move the waiter onto the mutex. If the mutex is free the requeuer acquires it on behalf of the waiter and calls requeue_pi_wake_futex to let the waiter know.

The bug is in requeue_pi_wake_futex and hard to spot, q is a variable on the stack of the waiter’s call, it occurs exactly when you signal the waiter that the lock is acquired and then try to read from q on the next line but the waiter already saw the signal returned from its syscall and its stack frame is gone. The function is called by the requeuer when the PI lock was acquired atomically on behalf of the waiter its job is to clean up the queue entry and wake the waiter Given all this information, can you spot the bug? :’)

/**
 * requeue_pi_wake_futex() - Wake a task that acquired the lock during requeue
 * @q:		the futex_q
 * @key:	the key of the requeue target futex
 * @hb:		the hash_bucket of the requeue target futex
 *
 * During futex_requeue, with requeue_pi=1, it is possible to acquire the
 * target futex if it is uncontended or via a lock steal.
 *
 * 1) Set @q::key to the requeue target futex key so the waiter can detect
 *    the wakeup on the right futex.
 *
 * 2) Dequeue @q from the hash bucket.
 *
 * 3) Set @q::rt_waiter to NULL so the woken up task can detect atomic lock
 *    acquisition.
 *
 * 4) Set the q->lock_ptr to the requeue target hb->lock for the case that
 *    the waiter has to fixup the pi state.
 *
 * 5) Complete the requeue state so the waiter can make progress. After
 *    this point the waiter task can return from the syscall immediately in
 *    case that the pi state does not have to be fixed up.
 *
 * 6) Wake the waiter task.
 *
 * Must be called with both q->lock_ptr and hb->lock held.
 */
static inline
void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
			   struct futex_hash_bucket *hb)
{
	q->key = *key;

	__futex_unqueue(q);

	WARN_ON(!q->rt_waiter);
	q->rt_waiter = NULL;

	q->lock_ptr = &hb->lock;

	/* Signal locked state to the waiter */
	futex_requeue_pi_complete(q, 1);
	wake_up_state(q->task, TASK_NORMAL);
}

If you did not find it, that’s OK. This isn’t a classical allocate, free, use bug pattern. There is no kmalloc here or kfree at all.

CPU1 (the requeuer) is running requeue_pi_wake_futex. CPU0 (the waiter) called futex_wait_requeue_pi and is blocked in the kernel waiting for someone to signal it, with struct futex_q q declared as a local variable on its kernel stack.

q->key = *key the requeuer overwrites the waiter’s futex key to point at the requeue target (uaddr2) instead of the original condition variable (uaddr1). When the waiter eventually wakes up it checks this key to know which futex it was moved to.
__futex_unqueue(q) removes q from the hash bucket’s wait queue. After this no other futex_wake call can find this waiter. It’s the requeuer’s responsibility to wake it.
q->rt_waiter = NULL clears the RT waiter pointer. The waiter checks this field when it wakes up, if it’s NULL the waiter knows the lock was acquired atomically and there’s no rt_mutex_waiter to clean up.
q->lock_ptr = &hb->lock points the waiter’s lock pointer at the target hash bucket’s lock. If the waiter needs to do PI state fixup later it will take this lock to serialize with the requeuer. The waiter is still asleep through all of this. These four steps are safe.
futex_requeue_pi_complete(q, 1) is an atomic store that sets q->requeue_state = Q_REQUEUE_PI_LOCKED. The waiter on CPU0 is spinning on that field with atomic_cond_read_relaxed. The moment it sees LOCKED it can proceed. The comment in the source says it plainly: “After this point the waiter task can return from the syscall immediately.”
wake_up_state(q->task, TASK_NORMAL) the requeuer reads q->task and wakes the task.

Now read step 5 and step 6 again :’) Step 5 tells the waiter “you have the lock” with an atomic store. The waiter on CPU0 is spinning on that field, the moment it sees LOCKED it can return from its syscall. Step 6 reads q->task but q lives on the WAITER’s kernel stack and if the waiter already returned that stack frame is gone the requeuer is reading from an unknown memory. Boom.

The fast return path

The previous section showed the requeuer’s side what requeue_pi_wake_futex does, but this is only part of the magic, Why can the waiter return so fast that it beats the requeuer to step 6?

The race only works because the waiter can return without acquiring any locks this happens when q->pi_state is NULL.

The requeuer calls futex_requeue_pi_prepare(top_waiter, NULL) NULL because this is the atomic trylock path, no pi_state was created. When the waiter sees LOCKED:

case Q_REQUEUE_PI_LOCKED:
    if (q.pi_state && (q.pi_state->owner != current)) {
        spin_lock(q.lock_ptr);        // would serialize with requeuer
        ret = fixup_pi_owner(uaddr2, &q, true);
        put_pi_state(q.pi_state);
        spin_unlock(q.lock_ptr);
    }
    break;

pi_state is NULL. The entire block is skipped. The waiter cancels its timer and returns.

The race

So now we have a clear understanding of the bug itself in both the requeuer and the waiter, here is what actually happens with 2 CPUs.

CPU0 (waiter)                          CPU1 (requeuer)
──────────────                          ────────────────
futex_wait_requeue_pi():
  struct futex_q q;       ← ON STACK
  q.task = current;
  enqueue on uaddr1, sleep
                                        futex_requeue():
                                          lock both hash buckets
                                          futex_proxy_trylock_atomic()
                                            requeue_pi_prepare(q, NULL)
                                              q->pi_state = NULL
  *timeout fires*
  sees IN_PROGRESS → WAIT
  spins on requeue_state
                                            PI lock acquired (ret=1)
                                          requeue_pi_wake_futex(q):
                                            unqueue, clear rt_waiter
                                            requeue_pi_complete(q, 1)
                                              → atomic store: LOCKED
  sees LOCKED, exits spin             ←───┘
  pi_state == NULL → skip fixup
  hrtimer_cancel, return
  ← stack frame gone →                   wake_up_state(q->task, TASK_NORMAL)
                                                        ↑
                                          reading from a dead stack frame

On ARM64, the waiter’s spinloop uses atomic_cond_read_relaxed, which compiles to a WFE (Wait For Event) loop. When the requeuer’s atomic store fires an exclusive monitor event, the waiter wakes within 1-3 cycles. Tight af, but the requeuer’s very next instruction is ldr of q->task also fast.

Winning the race

The race is one instruction wide, that is tight af. You need an interrupt on the requeuer’s CPU between the atomic store and the pointer load. That buys you 1-10 microseconds of delay more than enough for the waiter to return and free the stack.

The standard trick from unprivileged userspace is a TLB shootdown IPI. A third thread on a third CPU calls mprotect on a shared mapping in a tight loop. Each mprotect sends an IPI to every CPU that has the page in its TLB. If the requeuer’s CPU has it cached, it takes the interrupt. Most calls miss the one-instruction window, but eventually you manage to win the race.

  waiter sleeping            waiter returned             spray
  ┌──────────────────┐       ┌──────────────────┐       ┌──────────────────┐
  │futex_wait_requeue│       │  (returned)       │       │                  │
  │                  │       │                   │       │                  │
  │ struct futex_q   │       │ struct futex_q    │       │ spray data       │
  │  q.task = current│ ───→  │  q.task = stale   │ ───→  │  fake task_struct*│
  │                  │       │                   │       │                  │
  │ rt_waiter        │       │ garbage           │       │ controlled       │
  │ timeout          │       │ garbage           │       │ controlled       │
  └──────────────────┘       └──────────────────┘       └──────────────────┘
  q is alive                 q is dead                   old q.task overwritten
  requeuer can read          data still in memory         requeuer reads fake ptr

  ─────────────────────────────────────────────────────────────────────────→
  requeuer: complete(q,1)    requeuer: INTERRUPTED        requeuer: wake_up_state(q->task)
                             ← interrupt window: 1-10μs →

However, when the interrupt lands the dead stack frame still contains the original valid q->task value current, set by __futex_queue when the waiter first enqueued. If the requeuer reads that wake_up_state just wakes the waiter again.

To actually do something useful, the waiter needs to overwrite its own dead stack frame before the requeuer reads it. The waiter returns from the syscall and comes back to userspace and immediately starts spraying for objects, if the spray succeeds it we land on task_struct* at the old q->task offset before the requeuer’s interrupt handler return you win (this was proven by Lu NDSS 2017).

The insight is that kernel stack data persists after a syscall returns the next syscall on the same thread reuses the same physical stack pages. By profiling which syscalls write user controlled data at which stack depths, you build a map of what you can control. Lu’s success said 91% coverage of the top 1KB with the right syscall selection, actual numbers can vary depending on the model you work on.

Do remember, you still need an infoleak.

try_to_wake_up internals

wake_up_state calls try_to_wake_up(p, TASK_NORMAL, 0) where p is now your fake pointer. Here’s what the kernel does with it (from kernel/sched/core.c):

int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	guard(preempt)();
	int cpu, success = 0;

	if (p == current) {
		...
	}

	scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
		smp_mb__after_spinlock();
		if (!ttwu_state_match(p, state, &success))
			break;

		smp_rmb();
		if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
			break;

		WRITE_ONCE(p->__state, TASK_WAKING);

		smp_cond_load_acquire(&p->on_cpu, !VAL);

		cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
		...
		ttwu_queue(p, cpu, wake_flags);
	}  // ← scoped_guard: raw_spin_unlock_irqrestore(&p->pi_lock)
	...
}

What is interesting here is something I watched being committed but never used, the kernel finally has a somewhat close ability to scope lifetime of objects, scoped_guard is a lock guard introduced by Peter Zijlstra in kernel in 2023. It uses gcc’s __attribute__((cleanup)) to automatically release a lock when execution leaves the block, just like RAII in C++ :’) who would have thought such things exist in the kernel. For raw_spinlock_irqsave it expands to:

// defined in include/linux/spinlock.h
DEFINE_LOCK_GUARD_1(raw_spinlock_irqsave, raw_spinlock_t,
		    raw_spin_lock_irqsave(_T->lock, _T->flags),
		    raw_spin_unlock_irqrestore(_T->lock, _T->flags),
		    unsigned long flags)

So scoped_guard(raw_spinlock_irqsave, &p->pi_lock) calls raw_spin_lock_irqsave(&p->pi_lock) at entry, and raw_spin_unlock_irqrestore(&p->pi_lock) at every exit. The unlock itself boils down to queued_spin_unlock:

// include/asm-generic/qspinlock.h
static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	smp_store_release(&lock->locked, 0);
}

Let’s go back to our primitive, p is under our control, p->pi_lock must be zero at entry or raw_spin_lock spins forever. ttwu_state_match calls __task_state_match:

// kernel/sched/core.c
int __task_state_match(struct task_struct *p, unsigned int state)
{
	if (READ_ONCE(p->__state) & state)
		return 1;

	if (READ_ONCE(p->saved_state) & state)
		return -1;

	return 0;
}

state is TASK_NORMAL which is TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE (0x3). So p->__state must have bit 0x1 or 0x2 set. If not, the match fails, the scoped_guard breaks out early, fires the unlock, and raw_spin_unlock_irqrestore writes zero to p->pi_lock->locked.

If both gates pass, the deeper path writes TASK_WAKING (0x200) to p->__state, reads on_cpu, wake_cpu, calls ttwu_queue which eventually calls ttwu_do_wakeup writing TASK_RUNNING (0) to p->__state.

Eventually, scoped_guard exits and performs the unlock atomically with smp_store_release. Unlike mutex_unlock which is not atomic, spinlocks are atomic, providing us just a write, the careful reader would notice that the real primitive is not in the unlock itself, but what happens before it. If both gates pass, the deeper path reaches ttwu_queue → ttwu_do_activate → activate_task → enqueue_task:

// kernel/sched/core.c
static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
{
	...
	p->sched_class->enqueue_task(rq, p, flags);
}

With p->sched_class pointing at the real fair_sched_class (address known from KASLR), this calls enqueue_task_fair. Inside, se = &p->se is our embedded sched_entity, and cfs_rq = se->cfs_rq is a pointer we control:

// kernel/sched/fair.c
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct sched_entity *se = &p->se;
	...
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);           // returns se->cfs_rq
		enqueue_entity(cfs_rq, se, flags);
		...
	}
}

enqueue_entity calls account_entity_enqueue:

// kernel/sched/fair.c
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	update_load_add(&cfs_rq->load, se->load.weight);
	...
	cfs_rq->nr_running++;
}

static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
	lw->weight += inc;
	lw->inv_weight = 0;
}

cfs_rq->load.weight += se->load.weight — an 8-byte controlled addition through a pointer we control, with a value (se->load.weight) we fully control in our sprayed fake task_struct.

I did not turn this primitive into an arbitrary read / write primitives, I only got far as to getting the write, but I believe that with the right effort it is doable, but I might be wrong as I’m just a pleb.

The fix

 void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key,
                            struct futex_hash_bucket *hb)
 {
+    struct task_struct *task;
+
     q->key = *key;
     __futex_unqueue(q);
     WARN_ON(!q->rt_waiter);
     q->rt_waiter = NULL;
     q->lock_ptr = &hb->lock;
+    task = READ_ONCE(q->task);

     futex_requeue_pi_complete(q, 1);
-    wake_up_state(q->task, TASK_NORMAL);
+    wake_up_state(task, TASK_NORMAL);
 }

The fix is pretty simple but requires deep understanding to spot it, the READ_ONCE prevents the kernel from reading q after the syscall returned.

Conclusions

This is an unusual bug because it works on stack frames rather than heap objects, we don’t see those often, if at all. Nobody calls kfree. The waiter just returns from a function and the data ceases to be meaningful, the alloc was a local variable the “free” was a ret instruction :’) You can’t spray a live thread’s kernel stack the way you spray a slab freelist, you have to wait for the function to return and then immediately spike the kernel with a syscall that writes controlled data at the right depth.

The race itself is one instruction which makes it non-trivial to exploit and trigger, however with careful planning it is doable. The only issue that remains is an infoleak, which I do not think is possible to achieve with this bug, that is up to the user to achieve one.

I found this bug pretty interesting and insightful, I hope you would too.

The epoll uaf

A couple of weeks ago Nicholas Carlini burned an epoll uaf race in fs/eventpoll.c. Commit 07712db80857 changed a kfree() to kfree_rcu(). The commit message says: “eventpoll: defer struct eventpoll free to RCU grace period.”

That one call fixed a uaf that had been reachable from any unprivileged process for a few years on any Linux / Android running a 6.6 and above kernel with the affected optimization. This post is the story about the bug itself, what it gives you and my (failed) attepmts at exploiting this on a real modern device.

LLM Technical summary

A use after free in the epoll graph walker in `fs/eventpoll.c`. A 2023 optimization (removing the global `epmutex`) left the RCU read side walkers `ep_get_upwards_depth_proc` and `reverse_path_check_proc` racing against `ep_free`, which frees `struct eventpoll` with `kfree()` (no RCU grace period). The walker follows `epi->ep` into a freed `eventpoll` in `kmalloc-256`. The walker writes `loop_check_gen` (u64 at offset 168) and `loop_check_depth` (u8 at offset 184) into the freed object, giving a constrained write primitive. Discovered by Nicholas Carlini. Fixed in commit `07712db80857` by changing `kfree(ep)` to `kfree_rcu(ep, rcu)`. Affects Linux 6.6+ kernels with the `epmutex` removal optimization (March 2023). Triggerable from any unprivileged process via nested epoll instances and same CPU preemption under `CONFIG_PREEMPT=y` and `CONFIG_PREEMPT_RCU=y`. Tested on Pixel 10 (Frankel). Cross cache exploitation to PTE pages was not achieved due to timing constraints between the 2ms race window and the 100ms slab to PTE transition pipeline. Same cache reclaim in `kmalloc-256` is straightforward via LIFO SLUB freelist.

I spent a bit on a Pixel 10 working on this bug and in the process learned more about CFS vruntime tricks, SLUB internals, and the ARM64 memory model than I probably needed to.

epoll in 2 seconds

If you’ve run a Linux server you’ve used epoll indirectly. It’s the kernel’s scalable I/O notification mechanism the thing that lets nginx watch tens of thousands of sockets without blocking a thread per connection. Three syscalls: epoll_create() makes an instance, epoll_ctl() adds or removes watched file descriptors, epoll_wait() blocks until something happens.

Linux manages everything as file, so epoll fd is itself a file descriptor. You can add an epoll to another epoll. This creates a directed graph of instances watching instances, and the kernel has validation code inside epoll_ctl(ADD) that walks this graph to check for cycles and depth violations, that validation code is where the bug lives.

epoll has a history of cves history of CVEs however, their exploitation is not documented and is very scarce.

Structures

epoll data structures and the UAF

struct eventpoll: one per epoll_create(). Has the wait queue, the RB tree of items being watched, and refs at offset 176: an hlist head that links every epitem pointing back at this instance from somewhere else. It’s the incoming-edges list in the graph.

struct epitem: one per (epoll instance, watched fd) pair. Has epi->ep, a pointer to its owning eventpoll. If the watched fd is itself an epoll, this epitem is also linked into that epoll’s refs hlist via fllink.

The graph walker iterates ep->refs, follows epi->ep for each entry to reach a parent eventpoll, and recurses. That epi->ep dereference is the UAF.

The 2023 Optimization

Before March 2023, every epoll_ctl(ADD) with a nested target acquired a global mutex called epmutex. Under HTTP benchmarks, 58% of CPU time was lost to contention on it.

A patch replaced epmutex with a per-instance refcount_t, added a dying flag to struct epitem, and narrowed the remaining lock to only be held during actual graph walks. Throughput went up 60%.

The race happens in the graph walkers ep_get_upwards_depth_proc and reverse_path_check_proc. Both functions iterate ep->refs under rcu_read_lock() while other threads tear down the structures they’re pointing at. The old epmutex had been incidentally serializing this, but the new optimization was too open and nobody noticed the walkers race. The reason is they don’t touch any of the data the mutex was nominally protecting, they were only reading data.

The Bug

static int ep_loop_check(struct eventpoll *ep, struct eventpoll *to)
{
	int depth, upwards_depth;

	inserting_into = ep;
	/*
	 * Check how deep down we can get from @to, and whether it is possible
	 * to loop up to @ep.
	 */
	depth = ep_loop_check_proc(to, 0);
	if (depth > EP_MAX_NESTS)
		return -1;
	/* Check how far up we can go from @ep. */
	rcu_read_lock();
	upwards_depth = ep_get_upwards_depth_proc(ep, 0);
	rcu_read_unlock();

	return (depth+1+upwards_depth > EP_MAX_NESTS) ? -1 : 0;
}

..
snip
..


static int ep_get_upwards_depth_proc(struct eventpoll *ep, int depth)
{
    int result = 0;
    struct epitem *epi;

    if (ep->gen == loop_check_gen)
        return ep->loop_check_depth;

    hlist_for_each_entry_rcu(epi, &ep->refs, fllink)
        result = max(result, ep_get_upwards_depth_proc(epi->ep, depth + 1) + 1);
    ep->gen = loop_check_gen;
    ep->loop_check_depth = result;
    return result;
}

ep_get_upwards_depth_proc runs under rcu_read_lock(). Each epitem is safe when unlinked, it’s freed via call_rcu(), so RCU keeps it alive through the read-side critical section. There’s even a comment in the source that acknowledges the RCU reader:

/* The rcu read side, reverse_path_check_proc(), does not make
 * use of the rbn field.
 */
call_rcu(&epi->rcu, epi_rcu_free);

That comment is correct about the epitem. It says nothing about what epi->ep points to.

Now look at the teardown path:

static void ep_free(struct eventpoll *ep)
{
    mutex_destroy(&ep->mtx);
    free_uid(ep->user);
    wakeup_source_unregister(ep->ws);
    kfree(ep);
}

kfree(). Immediate. No RCU grace period.

The walker loads epi->ep a pointer read, then dereferences the target but that eventpoll may have already been freed and reused by a completely different kmalloc-256 allocation.

Triggering it

The race timeline

I initially tried two threads on different CPUs, one walking the graph one closing an epoll fd, it didn’t work. The window between loading epi from the hlist and following epi->ep is a handful of ARM64 instructions. What does work is same-CPU preemption. The Frankel device I was testing on runs CONFIG_PREEMPT=y and CONFIG_PREEMPT_RCU=y, which means rcu_read_lock() just bumps a per-task counter it doesn’t disable preemption. A timer tick during the walk can yield the CPU to the closer thread even though the walker is mid-RCU.

Just to give you a sense on numbers (CONFIG_HZ=250, tick every 4 ms):

4,096 parents: walk takes ~400 us. Rarely overlaps a tick.
8,000 parents: ~2 ms. Overlaps reliably. About 4% hit rate per attempt.

If the closer thread busy waits for the trigger signal, the scheduler treats it the same priority as the walker and never switches, but if you add the closer usleep(1000) in a loop while waiting. Sleeping threads get scheduling priority when they wake and the scheduler preempts the walker immediately.

The Pixel’s default governor throttles to 729 MHz at idle, at that frequency the traversal timing shifts enough that the race stops firing entirely :’)

What Gets Written

struct eventpoll {
    struct mutex               mtx;                  /*     0    48 */
    wait_queue_head_t          wq;                   /*    48    24 */
    wait_queue_head_t          poll_wait;            /*    72    24 */
    struct list_head           rdllist;              /*    96    16 */
    rwlock_t                   lock;                 /*   112     8 */
    struct rb_root_cached      rbr;                  /*   120    16 */
    struct epitem *            ovflist;              /*   136     8 */
    struct wakeup_source *     ws;                   /*   144     8 */
    struct user_struct *       user;                 /*   152     8 */
    struct file *              file;                 /*   160     8 */
    u64                        gen;                  /*   168     8 */ /* read, then WRITE loop_check_gen */
    struct hlist_head          refs;                 /*   176     8 */ /* READ as hlist pointer           */
    u8                         loop_check_depth;     /*   184     1 */ /* WRITE 0 or a kernel pointer     */
    refcount_t                 refcount;             /*   188     4 */
    unsigned int               napi_id;             /*   192     4 */

    /* size: 200, cachelines: 4, members: 15 */
};

struct eventpoll lives in kmalloc-256 (order-1 slabs, 32 objects per slab, cpu_partial=52 on this device). init_on_free=1 is set by default on Frankel devices and Android adds custom padding at the end of each object therefore the structure is different from mainline linux a bit.

Since the traversal of refs.first is at offset 176, this is our target offset, which is critical as my main attempt to exploit this as a one shot w/o any infoleaks:

If it’s zero (the init_on_free case), the hlist looks empty. The walker skips the loop, writes loop_check_gen at 168 and a zero byte at 184, returns. Silent corruption of 9 bytes in whatever object gets reused.

If it’s nonzero, the walker follows it as a pointer to an epitem, computes container_of(), dereferences epi->ep, and recurses into wherever that points. This is an arbitrary write primitive.

If you can grab the object where you control offset 176, you steer the recursion. Each level writes loop_check_gen (a global u64 counter that increments per epoll_ctl(ADD)) and a zero byte at fixed offsets from the pointer target. That’s a constrained write primitive. What you do with it from there depends on what kmalloc-256 object you use for reclaim, and how creative you’re feeling.

Note: There are other paths I did not include in this blog. One of them leads to mutex_unlock that if you are careful and brave enough to walk into. They require tremendous memory pressure and some of them might be fruitful. Trivia: We also control and gen and loop_check_depth which allows to zero out (or write somewhat deterministically yet very slowly) a controlled value to the freed chunk.

Can You Cross Cache This?

I wanted to exploit this vuln as one shot primitive and wanted to do this using PTE corruption, my attempts failed, but this was my strategy. If I were to infoleak, I’d use a different primitive and then solve everything pretty easily with refs.first as a pointer. Note: This part is technical. If you are not familiar with PCPs, Page Table Entries or SLUB / Buddy internals, I encourage you to read about them before you try reading this part.

The freed objects goes into kmalloc-256 and uses order-1 slabs. ARM64 PTE pages are order-0 (4 KB). These sit on different PCP freelists. The order-1 page freed from the slab cache won’t satisfy an order-0 PTE request unless PCP overflows and buddy splits it. Arranging that overflow during the narrow race window turned out to be non-trivial. It was possible to perform the split w/o invoking the race, however, integrating both pieces together was never a succeess.

These pieces work separately. Shaping 244 out of 250 slab pages go to buddy with 16 children forking and faulting 8 GB each, all available UNMOVABLE order-1 gets split for PTE allocations. The slab2buddy transition works, the buddy2PTE transition works, the problem is combining them with the race. The walker finishes in about 2 ms. The full cross cache pipeline, SLUB discard, PCP drain, buddy insertion, PTE allocation with __GFP_ZERO takes on the order of 100 ms. The gen write needs to land on a physical page that has already completed the transition from slab to PTE, and those timelines don’t overlap. I couldn’t find a way to stretch the walk long enough without resorting to SCHED_FIFO or similar privileged tricks, which defeats the purpose.

Same-cache reclaim ignores this entirely. SLUB’s per-CPU freelist is LIFO: last freed, first allocated. An immediate kmalloc(256) on the same CPU gets you the exact slot. The hard part is finding a kmalloc-256 object with a useful layout at offsets 168 and 176, I did not invest too much time into this.

The Fix

Commit 07712db80857:

 static void ep_free(struct eventpoll *ep)
 {
     mutex_destroy(&ep->mtx);
     free_uid(ep->user);
     wakeup_source_unregister(ep->ws);
-    kfree(ep);
+    kfree_rcu(ep, rcu);
 }

The fix adds a struct rcu_head to eventpoll. kfree_rcu() defers the free until the RCU grace period ends. Since the walker holds rcu_read_lock(), the grace period can’t complete until it’s done.

Closing Thoughts

What stays with me about this bug isn’t the race condition or the allocator internals. It’s how much work it takes to understand which code paths in epoll are protected by what. Wait queue locks serialize callbacks file refcounts gate ep_free. __fput sequences cleanup. call_rcu defers epitem frees. Each mechanism covers something. You have to hold all of them in your head at once before you can point at epi->ep and be sure that nothing is keeping the target alive. I spent several days just on that part.

I encourage anyone to try to exploit this on a modern Android system, it sounds fun and I’d be interested to see how u managed to get a stable arb read and write primitives based on this bug.