0%

nsenter


作者: haozi007 日期:2020-02-15


nsenter模块分析

nsenter模块,主要涉及namespace管理(把当前进程加入到指定的namespace或者创建新的namespace)、uid和gid的映射管理以及串口的管理等。

涉及golang和c两种语言实现,具体实现代码:

libcontainer/nsenter, 核心实现在libcontainer/nsenter/nsexec.c。

模块入口

1
2
3
4
5
6
7
8
9
10
package nsenter

/*
#cgo CFLAGS: -Wall
extern void nsexec();
void __attribute__((constructor)) init(void) {
nsexec();
}
*/
import "C"

当有包import _ "github.com/opencontainers/runc/libcontainer/nsenter"的时候,会导致C语言实现的部分在编译的时候,编译到对应的可执行文件中。而这里的C代码,定义了一个构造函数init(void),从C语言的构造函数特性,可以了解到,构造函数会在main函数执行之前运行。那么,init(void)函数会在可执行文件一开始就运行。所以,nsexec()函数会第一个执行。

nsexec函数

主要功能如下:

  1. 设置log pipe,用于日志传输;
  2. 设置init pipe,用于namespace等配置数据的传输以及子进程pid的回传;
  3. ensure clone binary,用于解决CVE-2019-5736,防止/proc/self/exe导致的安全漏洞;
  4. 读取并解析init pipe传入的namespace等数据信息;
  5. 更新oom配置;
  6. 执行double fork

ensure clone binary

在第一次运行时,拷贝原始的二进制文件内容到内存。后续的二进制执行,都是使用的内存数据。从而消除,运行过程中二进制被修改,导致的安全漏洞。

具体实现待分析:clone_binary.c — ensure_cloned_binary()

double clone

nsexec中,进行了2次clone进程。

至于为何需要进行2次clone操作的原因,可以参考注释:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/*
* Okay, so this is quite annoying.
*
* In order for this unsharing code to be more extensible we need to split
* up unshare(CLONE_NEWUSER) and clone() in various ways. The ideal case
* would be if we did clone(CLONE_NEWUSER) and the other namespaces
* separately, but because of SELinux issues we cannot really do that. But
* we cannot just dump the namespace flags into clone(...) because several
* usecases (such as rootless containers) require more granularity around
* the namespace setup. In addition, some older kernels had issues where
* CLONE_NEWUSER wasn't handled before other namespaces (but we cannot
* handle this while also dealing with SELinux so we choose SELinux support
* over broken kernel support).
*
* However, if we unshare(2) the user namespace *before* we clone(2), then
* all hell breaks loose.
*
* The parent no longer has permissions to do many things (unshare(2) drops
* all capabilities in your old namespace), and the container cannot be set
* up to have more than one {uid,gid} mapping. This is obviously less than
* ideal. In order to fix this, we have to first clone(2) and then unshare.
*
* Unfortunately, it's not as simple as that. We have to fork to enter the
* PID namespace (the PID namespace only applies to children). Since we'll
* have to double-fork, this clone_parent() call won't be able to get the
* PID of the _actual_ init process (without doing more synchronisation than
* I can deal with at the moment). So we'll just get the parent to send it
* for us, the only job of this process is to update
* /proc/pid/{setgroups,uid_map,gid_map}.
*
* And as a result of the above, we also need to setns(2) in the first child
* because if we join a PID namespace in the topmost parent then our child
* will be in that namespace (and it will not be able to give us a PID value
* that makes sense without resorting to sending things with cmsg).
*
* This also deals with an older issue caused by dumping cloneflags into
* clone(2): On old kernels, CLONE_PARENT didn't work with CLONE_NEWPID, so
* we have to unshare(2) before clone(2) in order to do this. This was fixed
* in upstream commit 1f7f4dde5c945f41a7abc2285be43d918029ecc5, and was
* introduced by 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e. As far as we're
* aware, the last mainline kernel which had this bug was Linux 3.12.
* However, we cannot comment on which kernels the broken patch was
* backported to.
*
* -- Aleksa "what has my life come to?" Sarai
*/

包括父进程在内,一共涉及了3个进程,它们的关系序列如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Title: How to clone init process
Parent->Child: clone first child
Note right of Child: join namespace and unshare newuser
Child->Parent: send SYNC_USERMAP_PLS
Note left of Parent: update groups,uid and gid
Parent->Child: send SYNC_USERMAP_ACK
Note right of Child: unshare other namespace, except cgroup
Child->GrandChild: clone grand child
Child->Parent: send SYNC_RECVPID_PLS
Note left of Parent: get pid of childs
Parent->Child: send SYNC_RECVPID_ACK
Note left of Parent: send pid of childs to parent of myself(process of runc create)
Child->Parent: send SYNC_CHILD_READY
Note right of Child: finish
Parent->GrandChild: send SYNC_GRANDCHILD
Note left of Parent: wait SYNC_CHILD_READY from GrandChild
Note right of GrandChild: set sid,uid,gid
Note right of GrandChild: unshare cgroup namespace
GrandChild->Parent: send SYNC_CHILD_READY
Note left of Parent: finish
Note right of GrandChild: let go runtime take over process