0%

docker命令分析--Seccomp特性


作者: 耗子007


Secure computing mode (Seccomp)是Linux内核的特性。可以使用Seccomp来限制容器内的行为。
该特性的有效基于:

  • Docker编译时加上了seccomp
  • 内核打开了CONFIG_SECCOMP配置

修改默认Seccomp配置文件

默认的Seccomp配置文件禁止了44个系统调用。
可以参考默认的配置文件,自定义然后在docker run的时候用–security-opt设置自定义的配置文件,例如:

1
$ docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world

默认的Seccomp配置文件是一个白名单,没有指定的则是被禁止的。下表给出一些被禁止的系统调用(不是全部),以及原因。

Syscall Description
acct Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_key Prevent containers from using the kernel keyring, which is not namespaced.
adjtimex Similar to clock_settime and settimeofday, time/date is not namespaced. Also gated by CAP_SYS_TIME.
bpf Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN.
clock_ adjtime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clock_ settime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clone Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
create_ module Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE.
delete_ module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
finit_ module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
get_kernel_syms Deny retrieval of exported kernel and module symbols. Obsolete.
get_ mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
init_ module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
ioperm Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
iopl Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
kcmp Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
kexec_file_load Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT.
kexec_ load Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT.
keyctl Prevent containers from using the kernel keyring, which is not namespaced.
lookup_ dcookie Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN.
mbind Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
mount Deny mounting, already gated by CAP_SYS_ADMIN.
move_pages Syscall that modifies kernel memory and NUMA settings.
name_to_handle_at Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE.
nfsservctl Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1.
open_by_handle_at Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH.
perf_event_open Tracing/profiling syscall, which could leak a lot of information on the host.
personality Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns.
pivot_ root Deny pivot_root, should be privileged operation.
process_vm_readv Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
process_vm_writev Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
ptrace Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.
query_module Deny manipulation and functions on kernel modules. Obsolete.
quotactl Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN.
reboot Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT.
request_key Prevent containers from using the kernel keyring, which is not namespaced.
set_ mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
setns Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN.
settimeofday Time/date is not namespaced. Also gated by CAP_SYS_TIME.
stime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
swapon Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
swapoff Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
sysfs Obsolete syscall.
_sysctl Obsolete, replaced by /proc/sys.
umount Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
umount2 Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
unshare Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare –user.
uselib Older syscall related to shared libraries, unused for a long time.
userfaultfd Userspace page fault handling, largely needed for process migration.
ustat Obsolete syscall.
vm86 In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.
vm86old In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

禁用Seccomp

1
2
$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
unshare --map-root-user --user sh -c whoami