2

低版本docker veth没清理造成容器网络问题

 2 years ago
source link: https://zhangguanzhang.github.io/2022/09/07/docker-veth-not-clean/#/%E5%8F%82%E8%80%83%EF%BC%9A
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

低版本docker veth没清理造成容器网络问题



字数统计: 1.5k阅读时长: 8 min
 2022/09/07  86  Share

帮助同事看一个 gitlab-runner 的问题,最终发现是 docker 低版本的 bug,没清理掉 veth 导致的

同事 gitlab-ci.yml 大概内容为下面:

include:
- project: "xxx/ci-template"
file: "/backend/common_mini.yml"

test:
services:
- name: minio/minio
command: ["server","/data"]
alias: minio
- name: mysql:5.7.17
variables:
FILTER_COVER_PACKAGES: "grep -E 'impl'"
MYSQL_DATABASE: "docmini"
MINIO_UPODATE: "off"

使用的是 Docker executor 去跑的构建,他反馈说构建容器内部无法访问 minio 的 9000 端口,我大致看了下,发现 service 的 alias 实际上用的是 docker run 的 –link 实现的,也就是容器的 hosts 文件里添加记录指向容器IP,官方文档 service 字段 也是如此说明

构建过程是 7a9ff9c8ee95 无法访问 minio 里的 9000 端口,直接用 ip,不用 alias 别名都无法访问

$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7a9ff9c8ee95 83daaac121e6 "sh -c 'if [ -x /usr…" 41 seconds ago Up 40 seconds runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-build-2
883faff17a2d e59a4655709b "/usr/bin/dumb-init …" 42 seconds ago Exited (0) 41 seconds ago runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-predefined-1
2662fd58cad5 e59a4655709b "/usr/bin/dumb-init …" 43 seconds ago Exited (0) 42 seconds ago runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-predefined-0
139e1a01b50f 9546ca122d3a "docker-entrypoint.s…" About a minute ago Up About a minute 3306/tcp runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-mysql-1
de4647deead4 c15374551d3a "/usr/bin/docker-ent…" About a minute ago Up About a minute 9000/tcp runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-minio__minio-0
$ docker inspect de46 | grep -i pid
"Pid": 23888,
"PidMode": "",
"PidsLimit": 0,
$ nsenter --net -t 23888 curl 172.25.0.2:9000
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied.</Message><Resource>/</Resource><RequestId>17127033A4D00006</RequestId><HostId>ec1fb8ef-f0f0-488e-a71e-da444933f2ed</HostId></Error>
$ docker exec -ti 7a9ff9c8ee95 curl 172.25.0.2:9000
curl: (7) Failed to connect to 172.25.0.2 port 9000: Connection refused

看了下 iptables 和转发参数都没有任何问题,发现宿主机上都无法访问,只有该容器内部才可以访问

$ curl 172.25.0.2:9000
curl: (7) Failed to connect to 172.25.0.2 port 9000: Connection refused
$ nsenter --net -t 23888 curl 172.25.0.2:9000
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied.</Message><Resource>/</Resource><RequestId>17127033A4D00006</RequestId><HostId>ec1fb8ef-f0f0-488e-a71e-da444933f2ed</HostId></Error>

为了排除 minio 服务问题,清理掉上面的容器后,用官方的 nginx 镜像测试下:

$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$ docker run -d --name t1 --rm -p 81:80 nginx:alpine
91fa481376cbbbdf04dd7ed027048ad20f40eee18f4e7d916d9edba8da102412
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
91fa481376cb nginx:alpine "/docker-entrypoint.…" About a minute ago Up About a minute 0.0.0.0:81->80/tcp t1
$ docker inspect t1 | grep IPAddress
"IPAddress": "172.25.0.2",

发现访问还是有问题

$ curl 172.25.0.2
curl: (7) Failed to connect to 172.25.0.2 port 80: Connection refused
$ ip link set docker0 promisc on
$ curl 172.25.0.2
curl: (7) Failed to connect to 172.25.0.2 port 80: Connection refused

$ curl localhost:81
curl: (56) Recv failure: Connection reset by peer

清理掉容器后发现 veth 不对:

$ ip a s 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:87:52:b5 brd ff:ff:ff:ff:ff:ff
inet 10.226.48.239/23 brd 10.226.49.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe87:52b5/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:c0:7b:8b:bf brd ff:ff:ff:ff:ff:ff
inet 172.25.0.1/16 brd 172.25.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:c0ff:fe7b:8bbf/64 scope link
valid_lft forever preferred_lft forever
9005: veth4a669ee@if9004: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether aa:ad:04:ea:b9:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::a8ad:4ff:feea:b9a1/64 scope link
valid_lft forever preferred_lft forever
9007: vethe0ddac0@if9006: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 02:9c:d5:3d:18:8f brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::9c:d5ff:fe3d:188f/64 scope link
valid_lft forever preferred_lft forever

一个容器都没有后,不应该有上面的 veth,安装 bridge-utils 查看下果然残留了:

$ apt-get install -y bridge-utils 
$ brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.0242c07b8bbf no veth4a669ee
vethe0ddac0

因为 docker 容器分配 IP 是从前到后的,验证残留造成的可以多起几个容器,后续容器能访问就确定是 veth 没清理导致的:

$ docker run -d  --rm -p 81:80 nginx:alpine
a0952e42f6a0da9d1969b327e696022c2dea041061cee2fbf080134037c9c93b
$ docker run -d --rm -p 82:80 nginx:alpine
dcf7d4635f1379b321603760c71a94bf70b1b954bb5528d384f7f9d38d4ed005
$ docker run -d --rm -p 83:80 nginx:alpine
44d5611125bb773ecc24baf760d4ba19f90a33c0b5c802e6afbeee462f200df0
$ docker run -d --rm -p 84:80 nginx:alpine
145b99fea102f71bc99132f2ae5aa8401f890df5d10177964df3e3f4e3fd8281
$ curl localhost:82
curl: (56) Recv failure: Connection reset by peer
$ curl localhost:83
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

果然是残留导致的,然后看了下 docker 版本很低:

Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: N/A (expected: 4fc53a81fb7c994640722ac585fa9ca548971871)
init version: N/A (expected: )
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-184-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64

搜了下 docker bridge network veth not clean up 发现很多人遇到了,属于低版本的 bug,卸载用官方脚本安装后就正常了

curl -fsSL "https://get.docker.com/" | bash -s -- --mirror Aliyun

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK