19

解决 docker 的 read unix @->/run/containerd/s/xxx read: connection reset by p...

 3 years ago
source link: https://zhangguanzhang.github.io/2021/09/16/read-containerd-con-reset/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

解决 docker 的 read unix @->/run/containerd/s/xxx read: connection reset by peer...



字数统计: 1.1k阅读时长: 5 min
 2021/09/16  103  Share

为了测试关机对集群的影响,关机了几台机器后很多 pod 一直 CrashLoopBackOffRunContainerError 或者一直无法就绪

[root@CentOS76 ~]# docker info
Client:
Debug Mode: false

Server:
Containers: 404
Running: 258
Paused: 0
Stopped: 146
Images: 110
Server Version: 19.03.14
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ea765aba0d05254012b0b9e595e995c09186427f
runc version: v1.0.0-0-g84113eef6fc2
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.36.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.76GiB
Name: CentOS76
ID: BJ2X:EX7H:SCME:Q3AD:IP2M:IB2D:E4RL:XA4C:EOMQ:7S3F:DIA6:WQ2C
Docker Root Dir: /data/kube/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
reg.xxx.lan:5000
treg.yun.xxx.cn
127.0.0.0/8
Registry Mirrors:
https://registry.docker-cn.com/
https://docker.mirrors.ustc.edu.cn/
Live Restore Enabled: false
Product License: Community Engine

日志查看如下

RunContainerError: failed to start container "90353b19ae6c7209ba1785286c292f2362fa069b578f2e2731e93747c5ba1912": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/90353b19ae6c7209ba1785286c292f2362fa069b578f2e2731e93747c5ba1912/log.json: no such file or directory): runc did not terminate sucessfully: unknown

还有下面日志:

runc did not terminate sucessfully: runtime/cgo: pthread_create failed: Resource temporarily unavailable

container 9853a196008b92033a299e098d73d4268a76ce58faecfe40ca3411857d44a776: unknown error after kill: fork/exec /data/kube/bin/runc: resource temporarily unavailable: : unknown"

应该资源限制了,看了下默认的 kernel.pid_max 太小:

$ sysctl -n kernel.pid_max
32768

后面陆陆续续调整了一些下面的参数:

cat > /etc/security/limits.d/21-custom.conf<<EOF
* soft nproc 131072
* hard nproc 131072
* soft nofile 131072
* hard nofile 131072
root soft nproc 131072
root hard nproc 131072
root soft nofile 131072
root hard nofile 131072
EOF

sed -ri 's/^#(DefaultLimitCORE)=/\1=100000/' /etc/systemd/system.conf
sed -ri 's/^#(DefaultLimitNOFILE)=/\1=100000/' /etc/systemd/system.conf

然后重启后 pod 还没有好转,启动一直处于 Create 的容器会有下面错误:

[root@CentOS76 ~]# docker start 034f
Error response from daemon: read unix @->/run/containerd/s/2ac09cf054eb19b79336b25efe1aeeaf22bcf0d9559ca79b8459c3490cd6034f: read: connection reset by peer: unknown
Error: failed to start containers: 034f

手动起容器报错下面的,调整参数后更多是上面的报错。

$ docker run --rm nginx:1.19-alpine
docker: Errpr response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown.

read unix @->/run/containerd/s 这个按照流程走就是 contained 的问题了,可以从 源码 得知,如果没启动 containerd ,docker 则会 os.Exec 起一个 containerd

$ ps aux | grep '\scontainerd\s'
root 147580 2.4 0.1 10375568 104588 ? Ssl 17:06 3:15 containerd --config /var/run/docker/containerd/containerd.toml --log-level warn

我们的 docker 是官方的 static 二进制安装的,去看了下 rpm 安装的话会分离开,也就是有个 containerd 的 rpm,有一个 containerd.service 服务。 想着看下我们环境上的 containerd 的输出日志,但是源码看的话命令的输出都是绑定到 docker 的输出的。而且命令行参数固定的、无法改为 debug level。

手动杀掉启动下试试:

kill -9 147580 && containerd --config /var/run/docker/containerd/containerd.toml --log-level debug

另外开个 ssh 窗口发现 pod 状态都正常了。说明了 systemd 启动的 docker 有限制,去 dockerd 的 proc 目录啥的查找了下看没达到文件啥的限制

[root@CentOS76 ~]# pgrep dockerd
113233
[root@CentOS76 ~]# lsof -p 113233 | wc -l
956

最后找到问题所在,下面的Tasks: 2043 (limit: 2048) 限制

[root@CentOS76 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since 四 2021-09-16 16:53:21 CST; 4min 16s ago
Docs: http://docs.docker.io
Process: 113228 ExecStopPost=/bin/sh -c /sbin/iptables --wait -D INPUT -i cni0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 113225 ExecStopPost=/bin/sh -c /sbin/iptables --wait -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 113236 ExecStartPost=/sbin/iptables --wait -I INPUT -i cni0 -j ACCEPT (code=exited, status=0/SUCCESS)
Process: 113234 ExecStartPost=/sbin/iptables --wait -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
Process: 113231 ExecStartPre=/bin/bash -c test -d /var/run/docker.sock && rmdir /var/run/docker.sock || true (code=exited, status=0/SUCCESS)
Main PID: 113233 (dockerd)
Tasks: 2043 (limit: 2048)
Memory: 1.1G
CGroup: /system.slice/docker.service
├─ 89710 containerd-shim -namespace

systemd 的 DefaultTasksMax2048 ,另外对比了官方的 docker.service 是不限制 Tasks 的,我们没加:

$ systemctl cat docker
..
ExecReload=/bin/kill -s HUP $MAINPID
Restart=on-failure
RestartSec=5
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
Delegate=yes
KillMode=process

加了后重启 docker 就好了:

$ vi /etc/systemd/system/docker.service
TasksMax=infinity


systemctl daemon-reload && systemctl restart docker

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK