8

flannel下集群有个节点网络不通的一次排查

 3 years ago
source link: https://zhangguanzhang.github.io/2021/08/25/flannel-a-host-net-tmout/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

flannel下集群有个节点网络不通的一次排查



字数统计: 1.7k阅读时长: 9 min
 2021/08/25  141  Share

问题和版本没关系,客户的 node 信息啥的后面排错里有。有个节点通信有问题,其余节点都没问题。

先看下 flannelvxlanvtep 信息,客户是双网卡的,但是默认路由是这个网卡,不用管另外的网卡了。下面信息看了下 VtepMACpublic-ip 都正常。

$ kubectl get node -o yaml | grep -B4 public
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"76:21:69:41:de:fe"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.51
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"b6:61:5c:8d:d9:eb"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.52
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"1e:8c:3e:12:fc:0f"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.53
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"ba:fe:64:36:6e:a1"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.54
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"8e:c1:4d:18:e5:d6"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.55
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"fe:95:e6:bf:a0:62"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.56

coredns 的 pod ip 和 node 分布情况

$ kubectl -n kube-system get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5757945748-cr67w 1/1 Running 0 19h 172.27.2.7 10.25.1.56 <none> <none>
coredns-5757945748-krwfd 1/1 Running 0 19h 172.27.1.4 10.25.1.55 <none> <none>
coredns-5757945748-zf4zm 1/1 Running 0 19h 172.27.3.7 10.25.1.54 <none> <none>

curl 下 coredns 的 metrics 接口试试,只有 10.25.1.51 和其他节点无法通信。会导致下面的 curl 卡住。

curl  172.27.1.4:9153

目标机器 10.25.1.55 上通过 flannel.1 接口抓我们的 curl 包:

$ tcpdump -nn -i flannel.1 host 172.27.1.4 and port 9153 -vv
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:07:46.203094 IP (tos 0x0, ttl 64, id 56025, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57888 > 172.27.1.4.9153: Flags [S], cksum 0x6804 (correct), seq 879302783, win 28200, options [mss 1410,sackOK,TS val 56279718 ecr 0,nop,wscale 7], length 0
10:07:46.203173 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.27.1.4.9153 > 172.27.0.0.57888: Flags [S.], cksum 0x5969 (incorrect -> 0x163b), seq 4197245653, ack 879302784, win 27960, options [mss 1410,sackOK,TS val 431774697 ecr 56279718,nop,wscale 7], length 0
10:07:47.204797 IP (tos 0x0, ttl 64, id 56026, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57888 > 172.27.1.4.9153: Flags [S], cksum 0x641a (correct), seq 879302783, win 28200, options [mss 1410,sackOK,TS val 56280720 ecr 0,nop,wscale 7], length 0
10:07:47.204880 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.27.1.4.9153 > 172.27.0.0.57888: Flags [S.], cksum 0x5969 (incorrect -> 0x1251), seq 4197245653, ack 879302784, win 27960, options [mss 1410,sackOK,TS val 431775699 ecr 56279718,nop,wscale 7], length 0

看着是回复了报文 172.27.1.4.9153 > 172.27.0.0.57888,在我们 curl 的机器 10.25.1.51lsof -nPi :57888 看到的确实是卡住的 curl 命令 pid 。10.25.1.51 上也同时抓包看下

$ tcpdump -nn -i flannel.1 host 172.27.1.4 and port 9153 -vv
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:08:57.241129 IP (tos 0x0, ttl 64, id 34444, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x2fb2), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56350922 ecr 0,nop,wscale 7], length 0
10:08:58.242423 IP (tos 0x0, ttl 64, id 34445, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x2bc8), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56351924 ecr 0,nop,wscale 7], length 0
10:09:00.246423 IP (tos 0x0, ttl 64, id 34446, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x23f4), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56353928 ecr 0,nop,wscale 7], length 0

没收到包,从 eth1 抓下 flannel8475 端口(配置里我们改了 flannel 的端口)试试:

目标机器 10.25.1.55 上抓包

$ tcpdump -nn -i eth1 host 10.25.1.51 and port 8475 -vvv
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:09:40.966705 IP (tos 0x0, ttl 64, id 50110, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:40.966869 IP (tos 0x0, ttl 64, id 46192, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:41.968322 IP (tos 0x0, ttl 64, id 50327, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:41.968440 IP (tos 0x0, ttl 64, id 46957, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:43.099646 IP (tos 0x0, ttl 64, id 47316, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:43.972322 IP (tos 0x0, ttl 64, id 51119, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:43.972454 IP (tos 0x0, ttl 64, id 47934, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
^C

目标机器 10.25.1.51 上抓包:

$ tcpdump -nn -i eth1 host 10.25.1.55 and port 8475 -vvv
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:10:21.702308 IP (tos 0x0, ttl 64, id 6079, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:10:22.702441 IP (tos 0x0, ttl 64, id 6117, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:10:24.706444 IP (tos 0x0, ttl 64, id 7699, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82

完全没报文过来,看了下 flannel 的接口流量压根就没收到任何包:

$ ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 172.27.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::7421:69ff:fe41:defe prefixlen 64 scopeid 0x20<link>
ether 76:21:69:41:de:fe txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 28900 bytes 2113052 (2.0 MiB)
TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0

说明报文从 10.25.1.55 发出后没到 51 上,让客户开通 udp 8475 10.25.1.0/24 整个段的东西向安全组后就正常了。

$ curl  172.27.1.4:9153
^C
$ curl 172.27.1.4:9153
404 page not found

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK