8

一次 cni-plugins 导致集群 dns 无法解析的排错

 3 years ago
source link: https://zhangguanzhang.github.io/2021/08/24/cni-plugins-bridge-err/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

一次 cni-plugins 导致集群 dns 无法解析的排错



字数统计: 1.4k阅读时长: 7 min
 2021/08/24  161  Share

环境是 1.15.5 的 x86_64 的 k8s 。命令输出被我查看日志给冲掉了,大致描述下。
中间件 kafka 无法连上 zookeeper ,看了下日志报错域名无法解析。看了下 coredns 都挂了:

$ kubectl -n kube-system get po -o wide -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5757945748-l2d2g 0/1 CrashLoopBackOff 254 3d11h 172.27.0.2 10.25.1.55 <none> <none>
coredns-5757945748-w5pfx 0/1 CrashLoopBackOff 254 3d11h 172.27.0.5 10.25.1.55 <none> <none>
coredns-5757945748-wfndd 0/1 CrashLoopBackOff 254 3d11h 172.27.0.3 10.25.1.55 <none> <none>

查看了日志是报错无法连 kubernetes svc https://172.26.0.1:443/xxxx ,报错 No route to host

去节点 10.25.1.55docker ps -a | grep coredns 找 pause 的容器 id ,docker inspect xxxxx | grep -m1 -i pid 取进程 pid。然后 nsenter 进去

$ nsenter --net --target 14659 
$ ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 42:7d:b0:83:a9:aa brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.27.0.5/16 scope global eth0
valid_lft forever preferred_lft forever

curl 了下 svc ip https://172.26.0.1 发现报错 No route to host。然后直接用 ep 也就是 kube-apiserver 的真实 ip和端口访问下,发现不通

curk -kvL https://10.25.1.51:6443

然后 ping 下宿主机发现也不通。看了下转发都开了。然后也看了下也没安全软件。

然后看了下桥接发现了问题所在,先写下正常环境桥接信息。我们 flannel,不是二进制,容器都是在 cni-plugins 下桥接在 cni0 上的。下面找个正常环境的 coredns 做下信息展示:

# 取容器 pid
$ docker inspect d30 | grep -m1 -i pid
"Pid": 9079,

$ nsenter --net --target 9079 ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if210: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 06:3e:42:00:91:33 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.27.0.74/24 brd 172.27.0.255 scope global eth0
valid_lft forever preferred_lft forever

注意看 if 后面的数字,宿主机上查看下是哪个 veth

$ ip link | grep -E '^210'
210: vetha1ca1d55@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default

使用 brctl 看下 cni0 下是有这个的。

brctl show cni0 | grep vetha1ca1d55
vetha1ca1d55

机器异常的桥接信息

$ nsenter --net --target 14659 ip a s 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 42:7d:b0:83:a9:aa brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.27.0.5/16 scope global eth0
valid_lft forever preferred_lft forever

$ ip -o link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1\ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc pfifo_fast state UP mode DEFAULT qlen 1000\ link/ether fa:16:3e:35:09:13 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000\ link/ether fa:16:3e:e2:4d:8d brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT \ link/ether 02:42:3c:a2:d7:3b brd ff:ff:ff:ff:ff:ff
6: vethda399ca1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT \ link/ether a2:95:74:65:29:6d brd ff:ff:ff:ff:ff:ff link-netnsid 0
7: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT \ link/ether 0a:e5:a5:9a:66:8b brd ff:ff:ff:ff:ff:ff
8: veth96d8f326@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT \ link/ether 92:a5:9c:7c:dd:cb brd ff:ff:ff:ff:ff:ff link-netnsid 1
9: vethdf8eb371@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT \ link/ether 4a:fb:37:c3:f2:7e brd ff:ff:ff:ff:ff:ff link-netnsid 2
10: veth44ce32f4@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT \ link/ether 6e:96:73:2b:4e:52 brd ff:ff:ff:ff:ff:ff link-netnsid 3
11: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT qlen 1000\ link/ether 2a:f9:74:b7:11:b8 brd ff:ff:ff:ff:ff:ff
12: veth0f91b3a3@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT \ link/ether 0a:36:90:dc:0a:97 brd ff:ff:ff:ff:ff:ff link-netnsid 4

查看了下 cni0 压根对不上 veth44ce32f4:

$ ip a s cni0
11: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP qlen 1000
link/ether 2a:f9:74:b7:11:b8 brd ff:ff:ff:ff:ff:ff
inet 172.27.1.1/24 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::28f9:74ff:feb7:11b8/64 scope link
valid_lft forever preferred_lft forever
$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.2af974b711b8 no veth0f91b3a3
docker0 8000.02423ca2d73b no

应该是桥接错了,重新创建下发现好了:

$ kubectl -n kube-system delete pod -l k8s-app=kube-dns
pod "coredns-5757945748-l2d2g" deleted
pod "coredns-5757945748-wfndd" deleted
pod "coredns-5757945748-wjrdf" deleted
$ kubectl -n kube-system get po -o wide -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5757945748-4smll 1/1 Running 0 29s 172.27.1.3 10.25.1.55 <none> <none>
coredns-5757945748-d9wqk 1/1 Running 0 29s 172.27.3.6 10.25.1.54 <none> <none>
coredns-5757945748-dtvfl 1/1 Running 0 29s 172.27.2.6 10.25.1.56 <none> <none>

现场拿了下 cni-plugins 的校验值去查了下,我们使用的是 v0.7.0 版本。有必要升级下了。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK