CVE-2021–20226 a reference counting bug which leads to local privilege escalatio...

What is io_uring

Rough explanation

Roughly speaking, io_uring is the latest asynchronous I/O(Network/Filesystem) mechanism.

Please refer to some blogs/slides posted on the Internet for specs and detailed descriptions from the user’s perspective.
From here, I will continue to explain the outline of io_uring on the assumption that you understand it.

In io_uring, a file descriptor is first generated by a dedicated system call (io_uring_setup), and by issuing mmap() system call to it, Submission Queue(SQ) and Completion Queue(CQ) are mapped/shared in userspace memory.
This is used as ring buffer by both sides(Kernel/Userspace).
Entries for each system call such as read/write/send/recv are registered by writing SQE(Submission Queue Entry) to the shared memory.
And then execution is started by calling io_uring_enter().

Asynchronous execution

By the way, the important part this time is the implementation of asynchronous execution, so I will focus on that.
To explain it first, io_uring is not always executed asynchronously, but it is executed asynchronously as needed.
Please refer to the code below first.(After this, the Kernel v5.8 will be used to explain the behavior. The behavior may be slightly different from your environment.)

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/syscall.h>
#include <sys/fcntl.h>
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/io_uring.h>#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})static int uring_fd;struct iovec *io;
#define SIZE 32
char _buf[SIZE];int main(void) {
  // initialize uring
  struct io_uring_params params = { };
  uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/10, &params));
  unsigned char *sq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                       MAP_SHARED, uring_fd,
                                       IORING_OFF_SQ_RING));
  unsigned char *cq_ring = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                       MAP_SHARED, uring_fd,
                                       IORING_OFF_CQ_RING));
  struct io_uring_sqe *sqes = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE,
                                          MAP_SHARED, uring_fd,
                                          IORING_OFF_SQES));io = malloc(sizeof(struct iovec)*1);
  io[0].iov_base = _buf;
  io[0].iov_len = SIZE;struct timespec ts = { .tv_sec = 1 };
  sqes[0] = (struct io_uring_sqe) {
    .opcode = IORING_OP_TIMEOUT,
    //.flags = IOSQE_IO_HARDLINK,
    .len = 1,
    .addr = (unsigned long)&ts
  };
  sqes[1] = (struct io_uring_sqe) {
    .opcode = IORING_OP_READV,
    .addr = io,
    .flags = 0,
    .len = 1,
    .off = 0,
    .fd = SYSCHK(open("/etc/passwd", O_RDONLY))
  };
  ((int*)(sq_ring + params.sq_off.array))[0] = 0;
  ((int*)(sq_ring + params.sq_off.array))[1] = 1;
  (*(int*)(sq_ring + params.sq_off.tail)) += 2;int submitted = SYSCHK(syscall(__NR_io_uring_enter, uring_fd,
                                 /*to_submit=*/2, /*min_complete=*/0,
                                 /*flags=*/0, /*sig=*/NULL, /*sigsz=*/0));
  while(1){
    usleep(100000);
    if(*_buf){
      puts("READV executed.");
      break;
    }
    puts("Waiting.");
  }
}

In this code, after performing the necessary setup for the operations IORING_OP_TIMEOUT and IORING_OP_READV, it starts execution and then checks every 0.1 seconds to see if readv() is complete.
It seems that readv() will be completed after 1 second, considering that it is executed in the order of ring buffer. However, when I actually run it, the result was as follows.

$ ./sample
READV executed.

That is, execution of readv() was completed immediately.
This is because, as I said earlier, it is executed asynchronously as needed, but in this case an execution of readv() can be completed immediately (because it is known that its execution does not stop). So subsequent operation was compeleted first (IORING_OP_TIMEOUT was ignored for the time being).
As a test, check that readv() is executed synchronously (= in the handler of the system call) with the following systemtap[¹] script.

[¹]: A tool that allows you to flexibly execute scripts, such as tracing Kernel (but not only) functions and outputting variables at the traced points. I love this tool because Kernel debugging is a hassle.

#!/usr/bin/stapprobe kernel.function("io_read@/build/linux-b4NE0x/linux-5.8.0/fs/io_uring.c:2710"){
  printf("%s\n",task_execname(task_current()))
}

↓ This is the output when the previous program (name of the file is sample) is executed while above systemtap script is being executed. If it is asynchronous, it is easy to imagine that the execution task is registered in some worker, but since it is executed synchronously here, the name of the executable file which called the system call is printed.

$ sudo stap -g ./sample.stp
sample

cve-2021-20226-a-reference-counting-bug-which-leads-to-local-privilege-escalation-in-io-uring-e946bd69177a

So where did IORING_OP_TIMEOUT go? The answer is “passed to the Kernel Thread because it was determined that asynchronous execution was needed”. There are several criteria for this, and if they meet, they will be enqueued into the Queue for asynchronous execution. Here are some examples.

1. When the force async flag is enabled

} else if (req->flags & REQ_F_FORCE_ASYNC) {
  ......
  /*
   * Never try inline submit of IOSQE_ASYNC is set, go straight
   * to async execution.
   */
  req->work.flags |= IO_WQ_WORK_CONCURRENT;
  io_queue_async_work(req);

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4825

2. Decisions by the logic prepared for each operation. (e.g. Add IOCB_NOWAIT flag when calling readv() and return EAGAIN if execution is expected to stop)

static int io_read(struct io_kiocb *req, struct io_kiocb **nxt,
     bool force_nonblock)
{
 ......
 ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
 if (!ret) {
  ssize_t ret2;if (req->file->f_op->read_iter)
   ret2 = call_read_iter(req->file, kiocb, &iter);
  else
   ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);/* Catch -EAGAIN return for forced non-blocking submission */
  if (!force_nonblock || ret2 != -EAGAIN) {
   kiocb_done(kiocb, ret2, nxt, req->in_async);
  } else {
copy_iov:
   ret = io_setup_async_rw(req, io_size, iovec,
      inline_vecs, &iter);
   if (ret)
    goto out_free;
   return -EAGAIN;
  }
 }
    ......
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L2224

When EAGAIN is returned, it is enqueued into the Queue for asynchronous execution (if it is a type of operation that uses file descriptors, it gets references to the file structure here).

static void __io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
 ......ret = io_issue_sqe(req, sqe, &nxt, true);/*
  * We async punt it if the file wasn't marked NOWAIT, or if the file
  * doesn't support non-blocking read/write attempts
  */
 if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
     (req->flags & REQ_F_MUST_PUNT))) {
punt:
  if (io_op_defs[req->opcode].file_table) {
   ret = io_grab_files(req);
   if (ret)
    goto err;
  }/*
   * Queued up for async execution, worker will release
   * submit reference when the iocb is actually submitted.
   */
  io_queue_async_work(req);
  goto done_req;
 }
 ......
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4741

static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
   struct io_kiocb **nxt, bool force_nonblock)
{
 struct io_ring_ctx *ctx = req->ctx;
 int ret;switch (req->opcode) {
 case IORING_OP_NOP:
  ret = io_nop(req);
  break;
 case IORING_OP_READV:
 case IORING_OP_READ_FIXED:
 case IORING_OP_READ:
  if (sqe) {
   ret = io_read_prep(req, sqe, force_nonblock);
   if (ret < 0)
    break;
  }
  ret = io_read(req, nxt, force_nonblock);
  break;

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4314

3. When the IOSQE_IO_LINK|IOSQE_IO_HARDLINK flag is used(the execution order is specified) and the operation whose execution order is earlier is determined to require asynchronous execution.

(Connect as a link as described in the code below, execute in order, and if condition 2 is met in the middle, whole link will be enqueued into the asynchronous execution queue)

static bool io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
     struct io_submit_state *state, struct io_kiocb **link)
{
 ......
 /*
  * If we already have a head request, queue this one for async
  * submittal once the head completes. If we don't have a head but
  * IOSQE_IO_LINK is set in the sqe, start a new head. This one will be
  * submitted sync once the chain is complete. If none of those
  * conditions are true (normal request), then just queue it.
  */
 if (*link) {
  ......
  list_add_tail(&req->link_list, &head->link_list);/* last request of a link, enqueue the link */
  if (!(sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK))) {
   io_queue_link_head(head);
   *link = NULL;
  }
 } else {
  ......
  if (sqe_flags & (IOSQE_IO_LINK|IOSQE_IO_HARDLINK)) {
   req->flags |= REQ_F_LINK;
   INIT_LIST_HEAD(&req->link_list);if (io_alloc_async_ctx(req)) {
    ret = -EAGAIN;
    goto err_req;
   }
   ret = io_req_defer_prep(req, sqe);
   if (ret)
    req->flags |= REQ_F_FAIL_LINK;
   *link = req;
  } else {
   io_queue_sqe(req, sqe);
  }
 }return true;
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4858

Strictly speaking, IORING_OP_TIMEOUT is a little special and does not return EAGAIN like shown in 2. But (I think) it is easy to understand, so I use it as a sample.
As shown below, by linking an operation that requires asynchronous execution (IORING_OP_TIMEOUT) with another operation, you can see that the previous IORING_OP_READV is certainly executed after waiting for 1 second.

Add IOSQE_IO_HARDLINK flag to the IORING_OP_TIMEOUT operation in the sample code above to clarify that it is linked to the subsequent operation.

48c48
<     //.flags = IOSQE_IO_HARDLINK,
---
>     .flags = IOSQE_IO_HARDLINK,

Execution result

$ ./sample
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
Waiting.
READV executed.

At this time, if you display the name of the process that is executing io_read() in the same way as before, you will get the following output.

$ sudo stap -g ./sample.stp
io_wqe_worker-0

As you can see by looking at the process list, this is a Kernel Thread.

$ ps aux | grep -A 2 -m 1 sample
garyo     131388  0.0  0.0   2492  1412 pts/1    S+   19:03   0:00 ./sample
root      131389  0.0  0.0      0     0 ?        S    19:03   0:00 [io_wq_manager]
root      131390  0.0  0.0      0     0 ?        S    19:03   0:00 [io_wqe_worker-0]

Hereafter, this Kernel Thread will be referred to as a “worker”. This worker is generated by the following code and then, dequeues and executes the asynchronous execution tasks from Queue.

static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
{
 ......worker->task = kthread_create_on_node(io_wqe_worker, worker, wqe->node,
    "io_wqe_worker-%d/%d", index, wqe->node);
 ......
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io-wq.c#L621

Aside: As explained earlier, IORING_OP_TIMEOUT behaves slightly differently from the figure below, but it is described as such for simplicity. Strictly speaking, when io_timeout() is called, it sets io_timeout_fn() in the handler and starts the timer. After the time set by the timer has elapsed, io_timeout_fn() is called to load the operations connected to the link in the asynchronous execution queue. In other words, IORING_OP_TIMEOUT itself is not enqueued in the asynchronous execution queue. TIMEOUT is used in the explanation so that it is easy to imagine that execution will stop.

Precautions when offloading I/O operations to the Kernel

It was found out that asynchronous processing is performed by a worker running as a Kernel Thread. However, there is a precaution here. Since worker is runninng as a Kernel Thread, the execution context is different from the thread which calls io_uring related system calls.
Here, the “execution context” means the task_struct structure associated with the process and various information associated with it.
For example, mm (Manage the virtual memory space of the process) , cred (holds UID/GID/Capability),files_struct (holds a table for file descriptors. There’s an array of file structure in files_struct structure, and file descriptor is its index) and so on.

Of course, if it doesn’t refer to these structures in the thread that calls the system call, it may refer to the wrong virtual memory or file descriptor table, or issue I/O operations with Kernel Thread privileges (≒ root) [²].

[²]: By the way, this was an actual vulnerability, and at that time it forgot to switch cred, and operations were able to be executed with root privileges. Although the operation equivalent to open open() was not implemented at that time, it was possible to notify the privilege in sendmsg’s SCM_CREDENTIALS option that notifies the sender’s authority. It is a problem around D-Bus because the authority is confirmed by it. https://www.exploit-db.com/exploits/47779

Therefore, in io_uring, those references are passed to the worker so that the worker shares the execution context by switching its own context before execution. For example, you can see that then references to mm and cred are passed to the req->work in the following code.

static inline void io_req_work_grab_env(struct io_kiocb *req,
     const struct io_op_def *def)
{
 if (!req->work.mm && def->needs_mm) {
  mmgrab(current->mm);
  req->work.mm = current->mm;
 }
 if (!req->work.creds)
  req->work.creds = get_current_cred();
 if (!req->work.fs && def->needs_fs) {
  spin_lock(&current->fs->lock);
  if (!current->fs->in_exec) {
   req->work.fs = current->fs;
   req->work.fs->users++;
  } else {
   req->work.flags |= IO_WQ_WORK_CANCEL;
  }
  spin_unlock(&current->fs->lock);
 }
 if (!req->work.task_pid)
  req->work.task_pid = task_pid_vnr(current);
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L910

You can see that the reference to files_struct is passed to the req->work in the following code.

static int io_grab_files(struct io_kiocb *req)
{
 ......
 if (fcheck(ctx->ring_fd) == ctx->ring_file) {
  list_add(&req->inflight_entry, &ctx->inflight_list);
  req->flags |= REQ_F_INFLIGHT;
  req->work.files = current->files;
  ret = 0;
 }
 ......
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io_uring.c#L4634

Then, before execution, these are replaced with the contents of the worker’s current (a macro that gets the task_struct currently running thread).

static void io_worker_handle_work(struct io_worker *worker)
 __releases(wqe->lock)
{
 struct io_wq_work *work, *old_work = NULL, *put_work = NULL;
 struct io_wqe *wqe = worker->wqe;
 struct io_wq *wq = wqe->wq;do {
  ......if (work->files && current->files != work->files) {
   task_lock(current);
   current->files = work->files;
   task_unlock(current);
  }
  if (work->fs && current->fs != work->fs)
   current->fs = work->fs;
  if (work->mm != worker->mm)
   io_wq_switch_mm(worker, work);
  if (worker->cur_creds != work->creds)
   io_wq_switch_creds(worker, work);
  ......
  work->func(&work);
  ......
 } while (1);
}

https://elixir.bootlin.com/linux/v5.6.19/source/fs/io-wq.c#L443

What is io_uring

Recommend

BOSS直聘成功上市市值超过148亿美元！它是如何逆势崛起的？

GitHub - chrisseaton/rhizome: A JIT for Ruby, implemented in pure Ruby

IT测试,2年半PR流水账(未完待续)

海外中国公民文明社交指南

机构分析：比特币形成“死亡交叉” 下跌势头没有缓和迹象

用父亲节海报合集，再给爸爸送份礼

新西兰签证申请中心的递签地址

巨一科技、国芯科技、中科英泰、翱捷科技6月25日科创板上会

fix(deps): update dependency @chakra-ui/react to v1.6.4 (#51) · juzhiyuan/blog@9...

docs: added "赞助开源项目" · juzhiyuan/blog@77a0169 · GitHub

About Joyk