作者: 耗子007
Secure computing mode (Seccomp)是Linux内核的特性。可以使用Seccomp来限制容器内的行为。
该特性的有效基于:
- Docker编译时加上了seccomp
- 内核打开了CONFIG_SECCOMP配置
修改默认Seccomp配置文件
默认的Seccomp配置文件禁止了44个系统调用。
可以参考默认的配置文件,自定义然后在docker run的时候用–security-opt设置自定义的配置文件,例如:
1 | $ docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world |
默认的Seccomp配置文件是一个白名单,没有指定的则是被禁止的。下表给出一些被禁止的系统调用(不是全部),以及原因。
| Syscall | Description |
|---|---|
| acct | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT. |
| add_key | Prevent containers from using the kernel keyring, which is not namespaced. |
| adjtimex | Similar to clock_settime and settimeofday, time/date is not namespaced. Also gated by CAP_SYS_TIME. |
| bpf | Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN. |
| clock_ adjtime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
| clock_ settime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
| clone | Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS. |
| create_ module | Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE. |
| delete_ module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
| finit_ module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
| get_kernel_syms | Deny retrieval of exported kernel and module symbols. Obsolete. |
| get_ mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
| init_ module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
| ioperm | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
| iopl | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
| kcmp | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
| kexec_file_load | Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT. |
| kexec_ load | Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT. |
| keyctl | Prevent containers from using the kernel keyring, which is not namespaced. |
| lookup_ dcookie | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN. |
| mbind | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
| mount | Deny mounting, already gated by CAP_SYS_ADMIN. |
| move_pages | Syscall that modifies kernel memory and NUMA settings. |
| name_to_handle_at | Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE. |
| nfsservctl | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. |
| open_by_handle_at | Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH. |
| perf_event_open | Tracing/profiling syscall, which could leak a lot of information on the host. |
| personality | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
| pivot_ root | Deny pivot_root, should be privileged operation. |
| process_vm_readv | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
| process_vm_writev | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
| ptrace | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE. |
| query_module | Deny manipulation and functions on kernel modules. Obsolete. |
| quotactl | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN. |
| reboot | Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT. |
| request_key | Prevent containers from using the kernel keyring, which is not namespaced. |
| set_ mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
| setns | Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN. |
| settimeofday | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
| stime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
| swapon | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
| swapoff | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
| sysfs | Obsolete syscall. |
| _sysctl | Obsolete, replaced by /proc/sys. |
| umount | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
| umount2 | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
| unshare | Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare –user. |
| uselib | Older syscall related to shared libraries, unused for a long time. |
| userfaultfd | Userspace page fault handling, largely needed for process migration. |
| ustat | Obsolete syscall. |
| vm86 | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
| vm86old | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
禁用Seccomp
1 | $ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \ |