Lxcfs调研测试

Posted by Mathew on 2020-02-16

Summary

1-2 sentence summary of what this document is about and why we should work on it.

  为什么需要对容器的资源进行可视化隔离?

  容器可以通过cgroup的方式对资源的使用情况进行限制,包括: 内存,CPU等。但是需要注意的是,如果容器内的一个进程使用一些常用的监控命令,如: free, top等命令其实看到还是物理机的数据,而非容器的数据。这是由于容器并没有做到对/proc,/sys等文件系统的隔离。

简单来讲:使用lxcfs容器内部可以使用top、free、uptime、df查看容器自身实际配额和资源使用情况。

Background

What is the motivation for these changes? What problems will this solve? Include graphs, metrics, etc. if relevant.

容器资源视图隔离主要应用在哪些场景?

1.业务使用角度:

  有一些业务已经习惯了在传统的物理机,虚拟机上使用top,free等命令来查看系统的资源使用情况,但是容器没有做到资源视图隔离,那么在容器里面看到的数据还是物理机的数据。

2.应用程序运行角度:

  在容器里面运行进程和在物理机虚拟机上运行进程的运行环境是不同的。并且有些应用在容器里面运行进程会存在一些安全隐患:
  (1). 对于很多基于JVM的java程序,应用启动时会根据系统的资源上限来分配JVM的堆和栈的大小(我们现在业务更多是根据自己创建项目的配额直接在jvm启动参数中指定堆栈大小,存在部分)。而在容器里面运行运行JAVA应用由于JVM获取的内存数据还是物理机的数据,而容器分配的资源配额又小于JVM启动时需要的资源大小,就会导致程序启动不成功。并且在java应用里,一些java库也会根据资源视图分配堆和栈的大小,这同样会存在安全隐患。

  (2). 在CPU上也会存在问题,大对数的应用程序,比如nginx或者一些其它的中间件服务会根据其视图的cpuinfo文件信息设定默认的启动线程数。但是在容器内的进程总会从/proc/cpuinfo中获取到CPU的核数,而容器里面的/proc文件系统还是物理机的,从而会影响到运行在容器里面服务的性能。

Goals

What are the outcomes that will result from these changes? How will we evaluate success for the proposed changes?

  容器启动挂载lxcfs以保证在容器内可以正确识别资源隔离。

Non-Goals

To narrow the scope of what we’re working on, outline what this proposal will not accomplish.

  主要解决资源可视化隔离及使用过程稳定性问题,除此之外的暂不考虑。

Proposed Solution

Describe the solution to the problems outlined above. Include enough detail to allow for productive discussion from readers, including diagrams if necessary.

(一)、docker测试

(1). 依赖安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
]# yum install lxcfs-3.1.2-0.2.el7.x86_64 -y
]# rpm -ql lxcfs-3.1.2-0.2.el7.x86_64
/usr/bin/lxcfs
/usr/lib/systemd/system/lxcfs.service
/usr/lib64/lxcfs
/usr/lib64/lxcfs/liblxcfs.so
/usr/share/doc/lxcfs-3.1.2
/usr/share/doc/lxcfs-3.1.2/AUTHORS
/usr/share/licenses/lxcfs-3.1.2
/usr/share/licenses/lxcfs-3.1.2/COPYING
/usr/share/lxc/config/common.conf.d/00-lxcfs.conf
/usr/share/lxcfs
/usr/share/lxcfs/lxc.mount.hook
/usr/share/lxcfs/lxc.reboot.hook
/usr/share/man/man1/lxcfs.1.gz
/var/lib/lxcfs

]# cat /usr/lib/systemd/system/lxcfs.service
[Unit]
Description=FUSE filesystem for LXC
ConditionVirtualization=!container
Before=lxc.service
Documentation=man:lxcfs(1)

[Service]
ExecStart=/usr/bin/lxcfs /var/lib/lxcfs/
KillMode=process
Restart=on-failure
ExecStopPost=-/bin/fusermount -u /var/lib/lxcfs
Delegate=yes
ExecReload=/bin/kill -USR1 $MAINPID

[Install]
WantedBy=multi-user.target

(2). 启动lxcfs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
]# systemctl start lxcfs && systemctl status lxcfs
● lxcfs.service - FUSE filesystem for LXC
Loaded: loaded (/usr/lib/systemd/system/lxcfs.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2020-02-11 14:34:19 CST; 79ms ago
Docs: man:lxcfs(1)
Main PID: 27703 (lxcfs)
Tasks: 3
Memory: 736.0K
CGroup: /system.slice/lxcfs.service
└─27703 /usr/bin/lxcfs /var/lib/lxcfs/

Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 2: fd: 8: memory
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 3: fd: 9: rdma
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 4: fd: 10: devices
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 5: fd: 11: cpu,cpuacct
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 6: fd: 12: freezer
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 7: fd: 13: blkio
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 8: fd: 14: pids
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 9: fd: 15: net_cls,net_prio
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 10: fd: 16: cpuset
Feb 11 14:34:19 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com lxcfs[27703]: 11: fd: 17: name=systemd

(3). 启动容器挂载lxcfs测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
]# docker run -it  -m 256m -d --name=mathew-lxcfs --cpus 2 --cpuset-cpus "0,1" \
-v /var/lib/lxcfs/:/var/lib/lxcfs/:shared
-v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:slave \
-v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:slave \
-v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:slave \
-v /var/lib/lxcfs/proc/stat:/proc/stat:slave \
-v /var/lib/lxcfs/proc/swaps:/proc/swaps:slave \
-v /var/lib/lxcfs/proc/uptime:/proc/uptime:slave \
-v /var/lib/lxcfs/sys/devices/system/cpu/online:/sys/devices/system/cpu/online:slave \
docker.momo.com/k8s:python3.6.4 /bin/bash

[root@5738153c5666 deploy]# free -h
total used free shared buffers cached
Mem: 256M 1.6M 254M 0B 0B 0B
-/+ buffers/cache: 1.6M 254M
Swap: 512M 0B 512M
[root@5738153c5666 deploy]# grep processor /proc/cpuinfo
processor : 0
processor : 1

top查看
top - 06:53:41 up 0 min, 0 users, load average: 0.51, 0.65, 0.75
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.4%us, 0.0%sy, 0.0%ni, 99.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 262144k total, 3532k used, 258612k free, 0k buffers
Swap: 524288k total, 0k used, 524288k free, 0k cached

lscpu查看
[root@fe4e2c9ed171 deploy]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0,1 // 在线0和1 cpu核心
Off-line CPU(s) list: 2-31
Thread(s) per core: 0
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1200.440
BogoMIPS: 4205.51
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31

关于docker run挂载模式slave说明,参考 docker-bind-propagatiolxcfs-slave

(二)、k8s yaml挂载lxcfs测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-lxcfs
labels:
app: lxcfs
spec:
replicas: 3
selector:
matchLabels:
app: lxcfs
template:
metadata:
labels:
app: lxcfs
spec:
containers:
- name: pod-lscfs
image: docker.momo.com/momo:centos7
command: ['sh', '-c', 'echo Hello Kubernetes! && sleep 360000']
resources:
limits:
cpu: "500m"
memory: 1024Mi
requests:
cpu: "500m"
memory: 1024Mi
volumeMounts:
- mountPath: /proc/cpuinfo
name: cpuinfo
mountPropagation: HostToContainer
- mountPath: /proc/diskstats
name: diskstats
mountPropagation: HostToContainer
- mountPath: /proc/meminfo
name: meminfo
mountPropagation: HostToContainer
- mountPath: /proc/stat
name: stat
mountPropagation: HostToContainer
- mountPath: /proc/swaps
name: swaps
mountPropagation: HostToContainer
readOnly: true
- mountPath: /proc/uptime
name: uptime
mountPropagation: HostToContainer
- mountPath: /var/lib/lxc/
name: lxc
mountPropagation: HostToContainer
nodeSelector:
kubernetes.io/hostname: bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com
volumes:
- hostPath:
path: /var/lib/lxc/lxcfs/proc/cpuinfo
type: ""
name: cpuinfo
- hostPath:
path: /var/lib/lxc/lxcfs/proc/diskstats
type: ""
name: diskstats
- hostPath:
path: /var/lib/lxc/lxcfs/proc/meminfo
type: ""
name: meminfo
- hostPath:
path: /var/lib/lxc/lxcfs/proc/stat
type: ""
name: stat
- hostPath:
path: /var/lib/lxc/lxcfs/proc/swaps
type: ""
name: swaps
- hostPath:
path: /var/lib/lxc/lxcfs/proc/uptime
type: ""
name: uptime
- hostPath:
path: /var/lib/lxc/
type: ""
name: lxc

k8s适配这里注意hostpath挂载时候需要适配docker挂载的slave模式,需要加上mountPropagation: HostToContainer ,参数参考:HostToContainerkernel-sharedsubtree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
]# kubectl create -f deploy-pods.yaml
]# kubectl get pod -o wide | grep pod-lxcfs
pod-lxcfs-75d77d895f-bpbq7 1/1 Running 0 6m37s 10.232.141.137 bjdx-platform-kube-node-colocation-a029.prod.bjdx.momo.com <none> <none>
]# kubectl exec -it pod-lxcfs-75d77d895f-bpbq7 bash
## 内存资源可视化隔离正常
[root@pod-lxcfs-75d77d895f-bpbq7 /]# free -h
total used free shared buff/cache available
Mem: 1.0G 9.8M 1.0G 0B 0B 1.0G
Swap: 1.0G 0B 1.0G
## cpu 注意上面yaml cpu配额为500m即0.5核心,底层显示为1核心
top - 07:00:03 up 8 min, 0 users, load average: 0.39, 0.53, 0.67
Tasks: 5 total, 1 running, 4 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1048576 total, 1038376 free, 10200 used, 0 buff/cache
KiB Swap: 1048576 total, 1048576 free, 0 used. 1038376 avail Mem

[root@pod-lxcfs-75d77d895f-bpbq7 /]# grep processor /proc/cpuinfo
processor : 0

Risks

Highlight any risks here so your reviewers can direct their attention here.
  当Lxcfs服务重启或者crash时,之前已经挂载在容器/proc的挂载点会失效,导致在容器中执行free,top命令会失效。为了解决这个问题,现在的做法是使用systemd的方式在各个节点启动lxcfs服务,当lxcfs服务crash之后重启时,在lxcfs服务启动成功之后,会通过ExecStartPost的方式执行一个/usr/local/bin/container_remount_lxcfs.sh来重新对之前已经挂载过的容器进行重新的挂载操作。

1
2
3
4
5
6
7
8
9
10
11
12

# 模拟lxcfs crash看效果,
~]# systemctl stop lxcfs
~]# kubectl exec -it pod-lxcfs-75d77d895f-bpbq7 bash
[root@pod-lxcfs-75d77d895f-bpbq7 /]# free -h
Error: /proc must be mounted
To mount /proc at boot you need an /etc/fstab line like:
proc /proc proc defaults
In the meantime, run "mount proc /proc -t proc"
[root@pod-lxcfs-75d77d895f-bpbq7 /]# top

top: failed /proc/stat open: Transport endpoint is not connected

重启lxcfs服务需要remount lxcfs,否则在之前挂载lxcfs的容器中执行free top会失效
对容器进行热挂载lxcfs

1
2
]# PID=$(docker inspect --format '{{.State.Pid}}' mathew-lxcfs)
]# for file in meminfo cpuinfo loadavg stat diskstats swaps uptime;do nsenter --target $PID --mount -- mount -B "/var/lib/lxcfs/proc/$file" "/proc/$file"; done

进容器free、top、lscpu 验证是否生效

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[root@bb140d86a374 deploy]# free -h
total used free shared buffers cached
Mem: 256M 2.8M 253M 0B 0B 0B
-/+ buffers/cache: 2.8M 253M
Swap: 512M 0B 512M
[root@bb140d86a374 deploy]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0,1 // 在线cpu核心
Off-line CPU(s) list: 2-31
Thread(s) per core: 0
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1201.130
BogoMIPS: 4205.51
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#! /bin/bash

PATH=$PATH:/bin
LXCFS="/var/lib/lxc/lxcfs"
LXCFS_ROOT_PATH="/var/lib/lxc"

containers=$(docker ps | egrep -v 'pause|calico|cadvisor' | awk '{print $1}' | grep -v CONTAINE)
for container in $containers;do
mountpoint=$(docker inspect --format '{{ range .Mounts }}{{ if eq .Destination "/var/lib/lxc" }}{{ .Source }}{{ end }}{{ end }}' $container)
if [ "$mountpoint" = "$LXCFS_ROOT_PATH" ];then
echo "remount $container"
PID=$(docker inspect --format '{{.State.Pid}}' $container)
# mount /proc
for file in meminfo cpuinfo loadavg stat diskstats swaps uptime;do
echo nsenter --target $PID --mount -- mount -B "$LXCFS/proc/$file" "/proc/$file"
nsenter --target $PID --mount -- mount -B "$LXCFS/proc/$file" "/proc/$file"
done
# mount /sys
for file in online;do
echo nsenter --target $PID --mount -- mount -B "$LXCFS/sys/devices/system/cpu/$file" "/sys/devices/system/cpu/$file"
nsenter --target $PID --mount -- mount -B "$LXCFS/sys/devices/system/cpu/$file" "/sys/devices/system/cpu/$file"
done
fi
done

修改lxcfs.service 文件添加

1
2
# add remount script
ExecStartPost=/usr/local/bin/container_remount_lxcfs.sh

重启lxcfs验证容器资源可视化隔离是否生效

1
2
3
4
5
6
7
8
9
10
11
12
13
]# systemctl daemon-reload
]# systemctl restart lxcfs
]# docker exec -it mathew-lxcfs bash
[root@bb140d86a374 deploy]# free -h
total used free shared buffers cached
Mem: 256M 2.8M 253M 0B 0B 0B
-/+ buffers/cache: 2.8M 253M
Swap: 512M 0B 512M
]# kubectl exec -it pod-lxcfs-67d765f58b-7tjtt bash
[root@pod-lxcfs-67d765f58b-7tjtt /]# free -h
total used free shared buff/cache available
Mem: 1.0G 8.6M 1.0G 0B 0B 1.0G
Swap: 1.0G 0B 1.0G

Milestones

Break down the solution into key tasks and their estimated deadlines.

1)产出调研测试wiki。 2020-02-13
2)并在混部集群yarn离线任务测试节点运行。2020-02-14
3)持续观察lcxfs运行特性(期间模拟容器重启或者宿主机重启lcxfs重新挂载等情况是否稳定)。截止本季度

Open Questions

Ask any unresolved questions about the proposed solution here.

1)结合k8s部署的话有两点依赖
  (1) yaml中hostpath要适配mountPropagation: HostToContaine
  (2) 如果要使用(1)则需要再systemd docker.service文件中添加MountFlags=shared,否则restart lxcfs会导致k8s资源(pod)资源可视化失效。
2)同一节点,两个pod一个用了mountPropagation: HostToContaine即适配1)中(1),另外一个pod没有用mountPropagation: HostToContaine,这种场景 restart 该节点的lxcfs会导致资源可视化失效,原因待查。

Follow Up Tasks

What needs to be done next for this proposal?

  • [1] 了解社区lxcfs和k8s更稳定的适配方案。

参考

https://github.com/lxc/lxcfs
https://linuxcontainers.org/lxcfs/introduction/
https://github.com/libfuse/libfuse
https://kubernetes.io/docs/concepts/storage/volumes/#configuration
https://success.docker.com/article/not-a-shared-mount-error
https://github.com/denverdino/lxcfs-admission-webhook