3

18.09.03 docker daemon layer broken 的一次不优雅处理

 2 years ago
source link: https://zhangguanzhang.github.io/2022/02/10/docker-daemon-layer-broken/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

18.09.03 docker daemon layer broken 的一次不优雅处理



字数统计: 1.3k阅读时长: 6 min
 2022/02/10   Share

记录一次 18.09.03 docker daemon 存储的层损坏无法修复的过程,虽然不优雅,但是没找到更好的解决办法,暂时记录仅供参考。

机器重启后,部分 pod 无法启动。

$ docker info
Containers: 51
Running: 27
Paused: 0
Stopped: 24
Images: 23
Server Version: 18.09.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.51GiB
Name: hdzwvm000006238.novalocal
ID: AUFF:32CM:54KK:FA2F:M3GS:EI77:2VSQ:HH3T:2LXM:7AFG:WXAQ:IKSV
Docker Root Dir: /data/kube/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:

初步排查了下确认部分镜像损坏了,比如下面这个,history --no-trunc 看了下这个镜像的 rootfs 是 ubuntu ,结果报错下面:

$ docker run --rm -ti --entrypoint bash xxx.cn/base/xxxxxx-amd64:v2
standard_init_linux.go:207: exec user process caused "no such file or directory"

之前也有类似情况,但是 rmi后 load就好了。这次是 rmi 掉后手动 load 也不行,对比了镜像离线文件的 md5sum 和包里的是一样的。

$ md5sum ./images/xxxxxx-amd64-v2#release_zzzzzzz 
cd1cf11ac90d6df59a31460cb1624933 ./images/xxxxxx-amd64-v2#release_zzzzzzz

$ docker rmi xxx.cn/base/xxxxxx-amd64:v2
Untagged: xxx.cn/base/xxxxxx-amd64:v2
Deleted: sha256:fe7c32d1138c5215dba9fbfa4f675eff47f1a30605d9914fff34a5db00ad45f0
$ docker load -i xxxxxx-amd64-v2#release_zzzzzzz
Loaded image: xxx.cn/base/xxxxxx-amd64:v2
$ docker run --rm -ti --entrypoint bash xxx.cn/base/xxxxxx-amd64:v2
standard_init_linux.go:207: exec user process caused "no such file or directory"

然后排查到有安全软件 sangfor,并且机器重启过。

$ ps aux | grep san
root 1183 0.0 0.0 113184 1492 ? S Feb09 0:03 /bin/bash /sangfor/edr/agent/bin/eps_services_ctrl
root 5132 0.0 0.0 113436 1696 ? S Feb09 0:17 /bin/bash /sangfor/edr/agent/bin/abs_monitor
root 5164 0.0 0.0 48092 3392 ? S Feb09 0:04 /sangfor/edr/agent/bin/abs_deployer
root 5205 0.0 0.0 43036 1552 ? Ss Feb09 0:07 /sangfor/edr/agent/bin/edr_monitor
root 5378 0.0 0.0 194948 6260 ? Sl Feb09 0:04 /sangfor/edr/agent/bin/sfupdatemgr -p edr_monitor
root 5379 0.0 0.0 43360 3560 ? S Feb09 0:01 /sangfor/edr/agent/bin/ipc_proxy
root 5380 0.6 0.1 708028 29892 ? Sl Feb09 6:56 /sangfor/edr/agent/bin/edr_agent
root 5381 0.1 0.0 17060 1332 ? S< Feb09 1:58 /sangfor/edr/agent/bin/cpulimit --limit=50 --exe=edr_agent
root 5382 0.0 0.0 113568 1900 ? S Feb09 0:28 /bin/bash /sangfor/edr/agent/bin/asset_collection_cpulimit.sh
root 5383 0.0 0.0 128944 5444 ? Sl Feb09 0:27 /sangfor/edr/agent/bin/edr_sec_plan
root 5384 0.0 0.0 117656 8956 ? S Feb09 0:00 /sangfor/edr/agent/bin/lloader /sangfor/edr/agent/bin/../lmodules/isolate_area_tool.lua
root 5385 0.0 0.0 68916 3928 ? S Feb09 0:01 /sangfor/edr/agent/bin/lloader /sangfor/edr/agent/bin/../lmodules/isolate_area_main.lua
root 22594 0.0 0.0 112712 976 pts/2 S+ 11:37 0:00 grep --color=auto san
$ uptime -s
2022-02-09 17:19:09
You have new mail in /var/spool/mail/root
$ tail -n 40 /var/spool/mail/root
...
edr pid 5205
ls: cannot access /sangfor/edr/agent/bin/../packages/: No such file or directory

$ ll /etc/cron.d
total 12
-rw-r--r--. 1 root root 128 Aug 3 2017 0hourly
-rw-r--r-- 1 root root 60 Dec 10 2020 edr_agent
-rw-------. 1 root root 235 Apr 1 2020 sysstat
You have new mail in /var/spool/mail/root
$ cat edr_agent
* * * * * root /sangfor/edr/agent/bin/eps_services_check.sh

让客户卸载掉后还是不行,然后 save 了下发现了问题:

$ docker save -o test.tar xxx.cn/base/xxxxxx-amd64:v2 
Error response from daemon: open /data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff/lib/x86_64-linux-gnu/ld-2.23.so: no such file or directory
$ docker inspect xxx.cn/base/xxxxxx-amd64:v2 | grep 920a0
"LowerDir": ...:/data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff",

$ ls -l /data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff/lib/x86_64-linux-gnu/ | head
total 10684
lrwxrwxrwx 1 root root 10 Feb 6 2019 ld-linux-x86-64.so.2 -> ld-2.23.so
lrwxrwxrwx 1 root root 15 Feb 7 2016 libacl.so.1 -> libacl.so.1.1.0
-rw-r--r-- 1 root root 31232 Feb 7 2016 libacl.so.1.1.0
-rw-r--r-- 1 root root 14992 Feb 6 2019 libanl-2.23.so
lrwxrwxrwx 1 root root 14 Feb 6 2019 libanl.so.1 -> libanl-2.23.so
lrwxrwxrwx 1 root root 20 May 29 2019 libapparmor.so.1 -> libapparmor.so.1.4.0
-rw-r--r-- 1 root root 64144 May 29 2019 libapparmor.so.1.4.0
lrwxrwxrwx 1 root root 16 Sep 9 2014 libattr.so.1 -> libattr.so.1.1.0
-rw-r--r-- 1 root root 18624 Sep 9 2014 libattr.so.1.1.0

把那个镜像的离线文件拿到其他机器上 load 后看了下该层是有文件 ld-2.23.so 的:

$ ll b5f1b3d6665a476b9460532568499f2923c1621d710f6a1e20cf7f3e1a928e17/diff/lib/x86_64-linux-gnu/
total 10844
-rwxr-xr-x 1 root root 162632 Feb 6 2019 ld-2.23.so
lrwxrwxrwx 1 root root 10 Feb 6 2019 ld-linux-x86-64.so.2 -> ld-2.23.so

最后本地试了下,发现如果 daemon 的层损坏了,rmi 后 load 是不会重新覆盖的,正常 load 一个新镜像 load 的时候是会有层显示的,类似下面:

$ docker load -i netshoot#latest 
b2d5eeeaba3a: Loading layer [==================================================>] 5.88MB/5.88MB
681ff9ab4914: Loading layer [==================================================>] 301.4MB/301.4MB
0e91662a9cb3: Loading layer [==================================================>] 8.683MB/8.683MB
fdcdfe126cc0: Loading layer [==================================================>] 13.63MB/13.63MB
270c883ade5e: Loading layer [==================================================>] 45.31MB/45.31MB
06e19b7687c5: Loading layer [==================================================>] 14.54MB/14.54MB
def3433d213c: Loading layer [==================================================>] 4.566MB/4.566MB
5b6adb9801a8: Loading layer [==================================================>] 869.9kB/869.9kB
765e2d110fbc: Loading layer [==================================================>] 1.831MB/1.831MB
eead121d6964: Loading layer [==================================================>] 7.168kB/7.168kB
400127227d7a: Loading layer [==================================================>] 3.072kB/3.072kB
2b4f749a4a39: Loading layer [==================================================>] 6.571MB/6.571MB
Loaded image: netshoot:latest
$ docker load -i netshoot#latest
Loaded image: netshoot:latest

只有镜像存在的情况下只显示一个 Loaded image,回看之前我们 rmi 后 load 就是没有层显示。看了下代码暂时没看出怎么判断是否已存在的,然后把 overlay2 目录删了暂时解决的。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK