作者： haozi007 日期：2020-02-15

骚气的容器创建流程

runc create的流程，包含不少骚气的操作，我们首先把大体流程梳理清楚，然后慢慢探索这些细节。

graph LR
main.go --> createCommand
createCommand --> revisePidFile
createCommand --> setupSpec
createCommand --> startContainer
createCommand --> Exit
setupSpec --> loadSpec
startContainer --> newNotifySocket
startContainer --> createContainer
startContainer --> setupSocket
startContainer --> runner.run.CT_ACT_CREATE

createContainer

负责创建libcontainer.Container结构体，并且设置容器的相关配置。主要流程如下：

把oci spec转换为libcontainer能识别的配置结构体configs.Config
loadFactory中，初始化了cgroup的manager

loadFactory中，创建linuxFactory，注意InitPath和InitArgs的值，如何从runc create拉起runc init进程的关键点

LinuxFactory{
		Root:      root,
		InitPath:  "/proc/self/exe",
		InitArgs:  []string{os.Args[0], "init"},
		Validator: validate.New(),
		CriuPath:  "criu",
	}

创建linuxContainer

linuxContainer{
		id:            id,
		root:          containerRoot,
		config:        config,
		initPath:      l.InitPath,
		initArgs:      l.InitArgs,
		criuPath:      l.CriuPath,
		newuidmapPath: l.NewuidmapPath,
		newgidmapPath: l.NewgidmapPath,
		cgroupManager: l.NewCgroupsManager(config.Cgroups, nil),
	}

注意：创建的容器的initPath和initArgs分别为”/proc/self/exe”和“init”，在后续的流程中会体会到其作用。

runner.run

首先看看utils_linux.go的runner结构体。

type runner struct {
	// 标识启动的进程是否为容器的1号进程
	init            bool
	// 标识当前进程作为子孙进程的收割进程（作用等价于1号进程）
	enableSubreaper bool
	// 标识是否需要清理动作，删除cgroup、poststop hooks等等
	shouldDestroy   bool
	// 标识是否以分离方式运行容器
	detach          bool
	listenFDs       []*os.File
	preserveFDs     int
	// pid文件路径
	pidFile         string
	// 用于接收console伪终端的master，是一个AF_UNIX的socket路径
	consoleSocket   string
	// 上一步创建的container结构体
	container       libcontainer.Container
	// runner的操作类型
	action          CtAct
	// 用于notify的socket文件
	notifySocket    *notifySocket
	// CRIU相关配置
	criuOpts        *libcontainer.CriuOpts
	// 日志级别
	logLevel        string
}

// 执行runner的操作，支持CREATE，RESTORE，RUN
func (r *runner) run(config *specs.Process) (int, error)

// 执行容器的清理动作，根据状态执行对应操作
func (r *runner) destroy()

// 终止容器进程
func (r *runner) terminate(p *libcontainer.Process)

// 检查终端，console和detach配置是否正确
func (r *runner) checkTerminal(config *specs.Process) error

run()函数中，主要是准备一个libcontainer.Process，用于传递linuxContainer.Start流程。

大体流程如下

graph LR
run-->prepare
prepare-->checkTerminal
prepare-->newProcess
prepare-->append-ExtraFiles
prepare-->set-uid-gid
prepare-->newSignalHandler
prepare-->setupIO
prepare-->container.Start

准备的process

主要包括几个方面：

newProcess创建结构体，并且初始化容器的配置到该结构体；
添加拓展的fd到该结构体的ExtraFiles，以及设置LISTEN_FDS的环境变量；
设置uid，gid；
初始化信号处理函数；
设置io

container-Start

第一步，创建execFIFO，这个FIFO文件的作用是，用于控制执行容器首进程的。在exec容器的首进程之前，会先往这个FIFO文件写入一个“0”字节，如果没有人打开这个FIFO，会导致写阻塞。因此，runc的start命令很简单，就是打开这个FIFO即可。

newParentProcess函数

最关键的一步，创建启动容器的process。

graph LR
newParentProcess-->commandTemplate
newParentProcess-->includeExecFifo
newParentProcess-->newInitProcess

commandTemplate函数，准备了运行的process的exec.Cmd结构体，比较感觉的几个配置，

// 记得上文中提到的关注点吗？这里的initPath为"/proc/self/exe"，而initArgs[1]为"init"
cmd := exec.Command(c.initPath, c.initArgs[1:]...)

// 通过环境变量，传递INITPIPE的句柄，在nsenter模块中将会使用
cmd.Env = append(cmd.Env,
		fmt.Sprintf("_LIBCONTAINER_INITPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
		fmt.Sprintf("_LIBCONTAINER_STATEDIR=%s", c.root),
	)

// 通过环境变量，传递LOGPIPE的句柄，在nsenter模块中将会使用
cmd.Env = append(cmd.Env,
		fmt.Sprintf("_LIBCONTAINER_LOGPIPE=%d", stdioFdCount+len(cmd.ExtraFiles)-1),
		fmt.Sprintf("_LIBCONTAINER_LOGLEVEL=%s", p.LogLevel),
	)

includeExecFifo函数，通过环境变量传递execFIFO句柄

1 2	cmd.Env = append(cmd.Env, fmt.Sprintf("_LIBCONTAINER_FIFOFD=%d", stdioFdCount+len(cmd.ExtraFiles)-1))

newInitProcess函数，设置初始化类型、设置bootstrap数据（nsenter模块设置的相关数据）、以及创建initProcess结构体

initProcess{
		cmd:             cmd,
		messageSockPair: messageSockPair,
		logFilePair:     logFilePair,
		manager:         c.cgroupManager,
		intelRdtManager: c.intelRdtManager,
		config:          c.newInitConfig(p),
		container:       c,
		process:         p,
		bootstrapData:   data,
		sharePidns:      sharePidns,
	}

启动initProcess

第一步，就是启动commandTemplate返回的Cmd，也就是通过exec启动了一个新的进程，而该进程的二进制为”/proc/self/exe”，表示当前进程的二进制，也就是runc，而第一个参数为init。因此，相当于执行了”runc init”。

那么，现在的程序结构如下

Title: runc create start init process
create->init: start new process
Note left of create: apply cgroup sets to init.pid
create->init: send bootstrap data to init
init->init: get bootstrap data, and do some works
create->create: wait child pids, and wait first child finish
init->create: send child and grand child pids
Note left of create: apply cgroup sets to child.pid
create->init: send sync message -- creatCgroupns
create->create: wait grand child finish
Note left of create: create network interface
create->init: send config data
create->create: wait sync message
Note right of init: ... now is runc init go codes...
init->init: get config data, do many works...
init->create: send sync message -- procReady
create->create: 1. set cgroup sets;2. run preStart hooks
create->init: send sync message -- procRun
Note left of create: another sync message is procHooks
create->create: wait init pipe closed
Note left of create: 1. update state of container, 2. run postStart hooks
Note left of create: finish

runc init

init进程的操作分为两部分：

第一部分，在nsenter中，执行double fork，设置namespace等相关操作；
第二部分，在init代码中，后续将进行详细分析。

从send config data开始，为第二部分的操作了。

graph LR
Init --> 配置网络
Init --> prepareRootfs
Init --> CreateConsole
Init --> finalizeRootfs
Init --> ApplyProfile
Init --> Readonly-And-Mask-Paths
Init --> syncParentReady
Init --> SetProcessLabel
Init --> InitSeccomp
Init --> finalizeNamespace
Init --> close-pipe-to-notify-init-complete
Init --> open-and-write-exec-fifo-to-wait-runc-start
Init --> exec-container-init-process

配置网络

涉及两个部分：

设置loop网络
设置路由信息

prepareRootfs

传播属性的概念参考文章。

peer group就是一个或多个挂载点的集合，他们之间可以共享挂载信息。
目前在下面两种情况下会使两个挂载点属于同一个peer group（前提条件是挂载点的propagation type是shared）

利用mount –bind命令，将会使源和目的挂载点属于同一个peer group，当然前提条件是”源”必须要是一个挂载点。

当创建新的mount namespace时，新namespace会拷贝一份老namespace的挂载点信息，于是新的和老的namespace里面的相同挂载点就会属于同一个peer group。

每个挂载点都有一个propagation type标志, 由它来决定当一个挂载点的下面创建和移除挂载点的时候，是否会传播到属于相同peer group的其他挂载点下去，也即同一个peer group里的其他的挂载点下面是不是也会创建和移除相应的挂载点。现在有4种不同类型的propagation type：

MS_SHARED: 从名字就可以看出，挂载信息会在同一个peer group的不同挂载点之间共享传播. 当一个挂载点下面添加或者删除挂载点的时候，同一个peer group里的其他挂载点下面也会挂载和卸载同样的挂载点。

MS_PRIVATE: 跟上面的刚好相反，挂载信息根本就不共享，也即private的挂载点不会属于任何peer group。

MS_SLAVE: 跟名字一样，信息的传播是单向的，在同一个peer group里面，master的挂载点下面发生变化的时候，slave的挂载点下面也跟着变化，但反之则不然，slave下发生变化的时候不会通知master，master不会发生变化。

MS_UNBINDABLE: 这个和MS_PRIVATE相同，只是这种类型的挂载点不能作为bind mount的源，主要用来防止递归嵌套情况的出现。这种类型不常见，本篇将不介绍这种类型。

Ps：需要补充说明的是：

propagation type是挂载点的属性，每个挂载点都是独立的。

挂载点是有父子关系的，比如挂载点/和/mnt/cdrom，/mnt/cdrom都是”/”的子挂载点，”/”是/mnt/cdrom的父挂载点。

默认情况下，如果父挂载点是MS_SHARED，那么子挂载点也是MS_SHARED的，否则子挂载点将会是MS_PRIVATE，跟祖父级别挂载点没有关系。

因此，runc首先修改容器namespace的根目录的propagation type(传播属性)；

func prepareRoot(config *configs.Config) error {
	flag := unix.MS_SLAVE | unix.MS_REC
	if config.RootPropagation != 0 {
		flag = config.RootPropagation
	}
	if err := unix.Mount("", "/", "", uintptr(flag), ""); err != nil {
		return err
	}

	// Make parent mount private to make sure following bind mount does
	// not propagate in other namespaces. Also it will help with kernel
	// check pass in pivot_root. (IS_SHARED(new_mnt->mnt_parent))
	if err := rootfsParentMountPrivate(config.Rootfs); err != nil {
		return err
	}

	return unix.Mount(config.Rootfs, config.Rootfs, "bind", unix.MS_BIND|unix.MS_REC, "")
}

然后修改rootfs的父挂载点的传播属性，第一防止pivot_root失败；第二防止rootfs中的bind mount传播到父挂载点。

// Make parent mount private if it was shared    
func rootfsParentMountPrivate(rootfs string) error {    
    sharedMount := false    
    
    parentMount, optionalOpts, err := getParentMount(rootfs)    
    if err != nil {    
        return err    
    }    
    
    optsSplit := strings.Split(optionalOpts, " ")    
    for _, opt := range optsSplit {    
        if strings.HasPrefix(opt, "shared:") {    
            sharedMount = true    
            break    
        }    
    }    
    
    // Make parent mount PRIVATE if it was shared. It is needed for two    
    // reasons. First of all pivot_root() will fail if parent mount is    
    // shared. Secondly when we bind mount rootfs it will propagate to    
    // parent namespace and we don't want that to happen.    
    if sharedMount {    
        return unix.Mount("", parentMount, "", unix.MS_PRIVATE, "")    
    }    
                                                                                                                                                        
    return nil    
}

写入sysctl配置

把config.Config.Sysctl设置的值写入到/proc/sys对应接口中，例如ip_forward

1	/proc/sys/net/ipv4/ip_forward

设置只读文件

把config.Config.ReadonlyPaths设置的目录remount为只读：

// readonlyPath will make a path read only.
func readonlyPath(path string) error {                                                                                                                  
    if err := unix.Mount(path, path, "", unix.MS_BIND|unix.MS_REC, ""); err != nil {
        if os.IsNotExist(err) {
            return nil
        }
        return err
    }
    return unix.Mount(path, path, "", unix.MS_BIND|unix.MS_REMOUNT|unix.MS_RDONLY|unix.MS_REC, "")
}

设置屏蔽文件

把config.Config.MaskPaths设置的目录屏蔽，通过把/dev/null bind mount覆盖对应文件实现：

// maskPath masks the top of the specified path inside a container to avoid
// security issues from processes reading information from non-namespace aware
// mounts ( proc/kcore ).
// For files, maskPath bind mounts /dev/null over the top of the specified path.
// For directories, maskPath mounts read-only tmpfs over the top of the specified path.
func maskPath(path string, mountLabel string) error {                                                                                                   
    if err := unix.Mount("/dev/null", path, "", unix.MS_BIND, ""); err != nil && !os.IsNotExist(err) {
        if err == unix.ENOTDIR {
            return unix.Mount("tmpfs", path, "tmpfs", unix.MS_RDONLY, label.FormatMountLabel("", mountLabel))
        }    
        return err
    }
    return nil
}

苦与乐

magic_create