8

基于LVS DR模式的Kubernetes Service External-IP实现

 2 years ago
source link: http://just4coding.com/2021/11/14/external-ip/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

基于LVS DR模式的Kubernetes Service External-IP实现

发表于

2021-11-14 更新于 2021-11-15 分类于 Kubernetes

之前的文章<<Kubernetes Service网络通信路径>>介绍了kubernetes的几种Service。如果要暴露服务给kubernetes集群外使用,可以选择NodePortLoadBalancer。但LoadBalancer现在主要在各大公有云厂商上能够原生支持。而使用NodePort暴露服务,将使用一个非常大的端口,无法使用原始的端口号来暴露服务,比如mysql3306端口。

Service官方文档中介绍了一种辅助方式, 叫External-IP, 可以在worker节点上会通过该IP来暴露服务,而且可以使用在任意类型的service上。集群外的用户就可以通过该IP来访问服务。但如果这个IP只存在于一个worker节点上,那么就不具备高可用的能力了,我们需要在多个worker节点上配置这个VIP:Virtual IP。我们可以使用LVS(也叫做IPVS)的DR(Director Routing)模式作为外部负载均衡器将流量分发到多个worker节点上,同时保持数据包的目的地址为该VIP

DR模式只会修改数据包的目的MAC地址为后端RealServerMAC地址,因而要求负载均衡器DirectorRealServer在同一个二层网络,而且响应包不会经过Director

下面我们来实验如何使用LVSDR模式实现service负载均衡。

我们在之前的实验集群中创建一个类型为ClusterIP(默认类型)的service, 指定一个外部IP:

apiVersion: v1
kind: Service
metadata:
labels:
name: whoami
name: whoami
spec:
ports:
- port: 80
name: web
protocol: TCP
selector:
app: whoami
externalIPs:
- 10.240.0.201
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: whoami
labels:
app: whoami
spec:
replicas: 3
selector:
matchLabels:
app: whoami
template:
metadata:
labels:
app: whoami
spec:
containers:
- name: whoami
image: containous/whoami
ports:
- containerPort: 80
name: web

创建服务:

kubectl apply -f whoami.yaml

查看服务,可以看到whoamiEXTERNAL-IP10.240.0.201:

[root@master1 ~]# kubectl get svc -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes ClusterIP 10.32.0.1 <none> 443/TCP 24d <none>
whoami ClusterIP 10.32.0.60 10.240.0.201 80/TCP 30m app=whoami

worker节点上检查iptables规则,可以看到在KUBE-SERVICES链中添加了EXTERNAL-IP相关的规则:

-A KUBE-SERVICES ! -s 10.230.0.0/16 -d 10.32.0.60/32 -p tcp -m comment --comment "default/whoami:web cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.32.0.60/32 -p tcp -m comment --comment "default/whoami:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-225DYIB7Z2N6SCOU
-A KUBE-SERVICES -d 10.240.0.201/32 -p tcp -m comment --comment "default/whoami:web external IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.240.0.201/32 -p tcp -m comment --comment "default/whoami:web external IP" -m tcp --dport 80 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-225DYIB7Z2N6SCOU
-A KUBE-SERVICES -d 10.240.0.201/32 -p tcp -m comment --comment "default/whoami:web external IP" -m tcp --dport 80 -m addrtype --dst-type LOCAL -j KUBE-SVC-225DYIB7Z2N6SCOU
-A KUBE-SERVICES ! -s 10.230.0.0/16 -d 10.32.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 10.32.0.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

当数据包目的地址为10.240.0.201:80时,跳转到KUBE-SVC-*链,从而分发到相应的pod中。由于在规则中指定了-m addrtype --dst-type LOCAL, 我们需要在节点上添加上这个VIP:

[root@node1 ~]# ip addr add 10.240.0.201/32 dev lo
[root@node1 ~]# ip addr show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 10.240.0.201/32 scope global lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever

因为这个VIP需要在多个worker节点上存在,因而把它配置在lo上,并抑制相应网卡上对该VIPARP响应:

sysctl -w net.ipv4.conf.eth1.arp_ignore = 1
sysctl -w net.ipv4.conf.eth1.arp_announce = 2

在节点上尝试访问VIP, 可以成功访问:

[root@node1 ~]# curl http://10.240.0.201
Hostname: whoami-5df4df6ff5-kbv68
IP: 127.0.0.1
IP: ::1
IP: 10.230.95.10
IP: fe80::d43a:9eff:fe3e:4425
RemoteAddr: 10.230.74.0:60086
GET / HTTP/1.1
Host: 10.240.0.201
User-Agent: curl/7.29.0
Accept: */*

[root@node1 ~]# curl http://10.240.0.201
Hostname: whoami-5df4df6ff5-n6jmj
IP: 127.0.0.1
IP: ::1
IP: 10.230.74.25
IP: fe80::9889:dff:fedf:f376
RemoteAddr: 10.230.74.1:60088
GET / HTTP/1.1
Host: 10.240.0.201
User-Agent: curl/7.29.0
Accept: */*

[root@node1 ~]# curl http://10.240.0.201
Hostname: whoami-5df4df6ff5-2h6qf
IP: 127.0.0.1
IP: ::1
IP: 10.230.74.24
IP: fe80::2493:9aff:fe7b:5dbd
RemoteAddr: 10.230.74.1:60090
GET / HTTP/1.1
Host: 10.240.0.201
User-Agent: curl/7.29.0
Accept: */*

接着我们在worker节点所在二层网络再启动一台虚拟机作为LVSDirector。在该机器上给与worker节点二层互通的网卡添加VIP:

ip addr add 10.240.0.201/32 dev eth1

使用ipvsadm创建负载均衡服务, 并使用DR模式添加两个worker节点做为后端的RealServer:

ipvsadm -A -t 10.240.0.201:80 -s rr
ipvsadm -a -t 10.240.0.201:80 -r 10.240.0.101 -g
ipvsadm -a -t 10.240.0.201:80 -r 10.240.0.102 -g

查看负载均衡服务:

[root@lb1 ~]# ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.240.0.201:80 rr
-> 10.240.0.101:80 Route 1 0 0
-> 10.240.0.102:80 Route 1 0 0

环境配置完成。我们找一台客户端访问VIP:10.240.0.201, 同时在Director机器上抓包,可以看到:

[root@lb1 ~]# tcpdump -ieth1 -nn -e tcp port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:50:01.024615 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 74: 10.240.0.10.38482 > 10.240.0.201.80: Flags [S], seq 1959573689, win 29200, options [mss 1460,sackOK,TS val 304318064 ecr 0,nop,wscale 6], length 0
11:50:01.024640 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 74: 10.240.0.10.38482 > 10.240.0.201.80: Flags [S], seq 1959573689, win 29200, options [mss 1460,sackOK,TS val 304318064 ecr 0,nop,wscale 6], length 0
11:50:01.026358 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 3346334225, win 457, options [nop,nop,TS val 304318066 ecr 304104626], length 0
11:50:01.026406 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 1, win 457, options [nop,nop,TS val 304318066 ecr 304104626], length 0
11:50:01.027197 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 142: 10.240.0.10.38482 > 10.240.0.201.80: Flags [P.], seq 0:76, ack 1, win 457, options [nop,nop,TS val 304318067 ecr 304104626], length 76: HTTP: GET / HTTP/1.1
11:50:01.027210 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 142: 10.240.0.10.38482 > 10.240.0.201.80: Flags [P.], seq 0:76, ack 1, win 457, options [nop,nop,TS val 304318067 ecr 304104626], length 76: HTTP: GET / HTTP/1.1
11:50:01.032443 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 327, win 473, options [nop,nop,TS val 304318070 ecr 304104630], length 0
11:50:01.032468 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 327, win 473, options [nop,nop,TS val 304318070 ecr 304104630], length 0
11:50:01.036452 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [F.], seq 76, ack 327, win 473, options [nop,nop,TS val 304318072 ecr 304104630], length 0
11:50:01.037159 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [F.], seq 76, ack 327, win 473, options [nop,nop,TS val 304318072 ecr 304104630], length 0
11:50:01.047556 08:00:27:2d:af:18 > 08:00:27:48:90:6c, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 328, win 473, options [nop,nop,TS val 304318087 ecr 304104647], length 0
11:50:01.047583 08:00:27:48:90:6c > 08:00:27:23:1b:95, ethertype IPv4 (0x0800), length 66: 10.240.0.10.38482 > 10.240.0.201.80: Flags [.], ack 328, win 473, options [nop,nop,TS val 304318087 ecr 304104647], length 0

数据包的目的MAC地址被修改为node2eth1MAC地址, 而且响应包并不经过Director:

[root@node2 ~]# ip link show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
link/ether 08:00:27:23:1b:95 brd ff:ff:ff:ff:ff:ff

本文只是简单实验可行性。如果用于生产环境,需要额外的方案考虑,比如:

  • LVS本身可以配合keepalived使用主备模式保证Director的HA
  • 使用OSPFECMP来配置多主的Director集群(可以参考之前的文章<<基于Cumulus VX实验ECMP+OSPF负载均衡>>)
  • 省略LVSDirector层,直接使用OSPFECMP将流量分发到worker节点的VIP

另外, 根据之前网上的这篇文章, worker节点可以不设置VIP,因为VIP并不需要由用户态程序来接收流量,而是直接由iptables来进行数据包转换。

在大多数场景下这是正确的。但是如果直接从worker节点上通过VIP访问该服务时,如果LVS把数据包返回到自身这台RealServer时, 由于数据包源IP就是本机IP, 这条规则无法匹配:

-A KUBE-SERVICES -d 10.240.0.201/32 -p tcp -m comment --comment "default/whoami:web external IP" -m tcp --dport 80 -m physdev ! --physdev-is-in -m addrtype ! --src-type LOCAL -j KUBE-SVC-225DYIB7Z2N6SCOU

而因为本机没有配置VIP, 下面这条规则的-m addrtype --dst-type LOCAL也无法匹配:

-A KUBE-SERVICES -d 10.240.0.201/32 -p tcp -m comment --comment "default/whoami:web external IP" -m tcp --dport 80 -m addrtype --dst-type LOCAL -j KUBE-SVC-225DYIB7Z2N6SCOU

因而数据包最终被丢弃。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK