6

集群节点关机导致dns在eviction pod之前几率不可用

 3 years ago
source link: https://zhangguanzhang.github.io/2021/02/02/node-shutdown-dns-unavailable/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

集群节点关机导致dns在eviction pod之前几率不可用



字数统计: 3.3k阅读时长: 17 min
 2021/02/02  177  Share

这几天我们内部在做新项目的容灾测试,业务都是在 K8S 上的。容灾里就是随便选节点 shutdown -h now。关机后同事便发现了(页面有错误,最终问题是)集群内 DNS 解析会有几率无法解析(导致的)。

根据 SVC 的流程,node 关机后,由于 kubelet 没有 update 自己。node 和 pod 在 apiserver get 的时候显示还是正常的。在 kube-controller-manager--node-monitor-grace-period 时间后再过 --pod-eviction-timeout 时间开始 eviction pod,大概流程是这样。

podeviction 之前,默认是大概 5m 的时间。这段时间内,node 上 的所有 PODIP 还在 SVCendpoint 里。而同事关机的 node 上恰好有 coredns 。所以在 5m 内一直会有 coredns 副本数之一的几率解析失败。

其实和 K8S 版本没关系,因为 SVCeviction 的行为都是这样的。实际我调整了 node 更新自身状态的所有 相关参数,调整到在 20s 内就会 eviction pod,但是 20s 内还是存在几率无法解析。当然也问了下群友和社区群里,发现似乎大家从来没关机测试过这方面,应该是现在大伙都在用公有云了。。。。

$  kubectl  version -o json
{
"clientVersion": {
"major": "1",
"minor": "15",
"gitVersion": "v1.15.5",
"gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
"gitTreeState": "clean",
"buildDate": "2019-10-15T19:16:51Z",
"goVersion": "go1.12.10",
"compiler": "gc",
"platform": "linux/amd64"
},
"serverVersion": {
"major": "1",
"minor": "15",
"gitVersion": "v1.15.5",
"gitCommit": "20c265fef0741dd71a66480e35bd69f18351daea",
"gitTreeState": "clean",
"buildDate": "2019-10-15T19:07:57Z",
"goVersion": "go1.12.10",
"compiler": "gc",
"platform": "linux/amd64"
}

loca-dns 真的可以吗

当然首选是 local-dns 的方案 了。方案搜下,很多人介绍了。简单讲下就是在每个 node 上起 hostNetworknode-cache 进程做代理 ,然后利用 dummy 接口和 nat 来拦截发向 kube-dns SVC IP 的 dns 请求做缓存。

官方提供的 yaml 文件 里的 __PILLAR__LOCAL__DNS__,__PILLAR__DNS__SERVER__需要换成dummy接口 IP 和 kube-dns SVC 的 IP,还有 __PILLAR__DNS__DOMAIN__ 自行根据文档更换。其余几个变量会在启动的时候替换,可以启动后看日志。

然后实际测试了下还是有问题。然后捋了下流程,yaml 文件里有这个 SVC 和 node-cache 的启动参数

apiVersion: v1
kind: Service
metadata:
name: kube-dns-upstream
namespace: kube-system
...
spec:
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 53
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 53
selector:
k8s-app: kube-dns
...
args: [ ..., "-upstreamsvc", "kube-dns-upstream" ]

启动的日志里可以看到配置文件被渲染了:

cluster1.local:53 {
errors
reload
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
in-addr.arpa:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
}
ip6.arpa:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 172.26.189.136 {
force_tcp
}
prometheus :9253
}
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . /etc/resolv.conf
prometheus :9253
}

因为要 nat 去 hook 请求 kube-dns SVC IP(172.26.0.2)的请求,但是它自己也需要访问 kube-dns,所以 yaml 文件里创建了一个和 kube-dns 一样的属性的 svc,启动参数写了这个 SVC 名字,可以看到它代理的是走 SVC 的 ip 的。因为 enableServiceLinks 的默认开启,pod 会有如下环境变量:

$ docker exec dfa env | grep KUBE_DNS_UPSTREAM_SERVICE_HOST
KUBE_DNS_UPSTREAM_SERVICE_HOST=172.26.189.136

代码里 可以看到就是把参数的 - 转换成 _ 取值然后渲染配置文件,这样就能取到 SVC 的 IP 了。

func toSvcEnv(svcName string) string {
envName := strings.Replace(svcName, "-", "_", -1)
return "$" + strings.ToUpper(envName) + "_SERVICE_HOST"
}

cluster1.local:53 这个 zone 在默认配置下还是代理到 SVC 上,所以还是有问题。

所以只有绕过 SVC 才能从根本上解决这个问题。然后就把 coredns 改成 port 153 + hostNetwork: truenodeSelector 到三个 master 上固定了。然后配置文件如下:

cluster1.local:53 {
errors
reload
bind 169.254.20.10 172.26.0.2
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
...

然后测试还是有几率无法访问。之前看到过 米开朗基杨 分享过 coredns 的一个带故障转移的插件 dnsredir,尝试加这个插件去编译。

查阅文档编译后最后运行起来无法识别配置文件,因为官方不是直接基于 coredns 引入自己的插件开发的,而是自己的代码上来引入 coredns 的内置插件。

大概过程详情 issue 见链接 include coredns plugin at node-cache don’t work expect

官方的这个 node-cace 里的 bind 插件就是 dummy接口和 iptables 的 nat 部分了,这个特性蛮吸引我的,决定继续尝试下这个看看能不能配置成功。

在测试加入插件 dnsredir 的时候米开朗基杨叫我试下最小配置段看看有干扰没,尝试了下面的配置段来回切换测:

  Corefile: |
cluster1.local:53 {
errors
reload
dnsredir . {
to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
max_fails 1
health_check 1s
spray
}
#forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
# max_fails 1
# policy round_robin
# health_check 0.4s
#}
prometheus :9253
health 169.254.20.10:8080
}
#----------
Corefile: |
cluster1.local:53 {
errors
reload
#dnsredir . {
# to 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153
# max_fails 1
# health_check 1s
# spray
#}
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
max_fails 1
policy round_robin
health_check 0.4s
}
prometheus :9253
health 169.254.20.10:8080
}

然后发现请求居然不会发生解析失败了:

$ function d(){ while :;do sleep 0.2; date;dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short; done; }
$ d
2021年 02月 02日 星期二 12:54:43 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:44 CST <---这个时间点关机了一个 master
172.26.158.130
2021年 02月 02日 星期二 12:54:45 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:47 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:48 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:51 CST
172.26.158.130
2021年 02月 02日 星期二 12:54:52 CST
172.26.158.130

然后就不打算继续折腾 dnsredir 插件了,去叫同事测试了下没问题,叫我在另一个环境上应用下修改他再测下,发现还是会发生:

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
; <<>> DiG 9.10.3-P4-Ubuntu <<>> @172.26.0.2 account-gateway +short
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short

然后我多次测试最小配置 zone,对比排查到是反向解析的问题,反向解析关闭了就不存在任何问题了,注释掉下面的内容:

#in-addr.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
#ip6.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }

测试解析的过程中去关机任何一台 coredns 所在 node 也没问题了。

$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124
$ dig @172.26.0.2 account-gateway.default.svc.cluster1.local +short
172.26.158.124

大致的yaml文件

apiVersion: v1
kind: ServiceAccount
metadata:
name: node-local-dns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns-upstream
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "KubeDNSUpstream"
spec:
clusterIP: 172.26.0.3 # <---- 给他固定了得了,可以直接这个ip不走node-cache作为测试
ports:
- name: dns
port: 53
protocol: UDP
targetPort: 153
- name: dns-tcp
port: 53
protocol: TCP
targetPort: 153
selector:
k8s-app: kube-dns
---
apiVersion: v1
kind: ConfigMap
metadata:
name: node-local-dns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: Reconcile
data:
Corefile: |
cluster1.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10 172.26.0.2
forward . 10.11.86.107:153 10.11.86.108:153 10.11.86.109:153 {
force_tcp
max_fails 1
policy round_robin
health_check 0.5s
}
prometheus :9253
health 169.254.20.10:8070
}
#in-addr.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
#ip6.arpa:53 {
# errors
# cache 30
# reload
# loop
# bind 169.254.20.10 172.26.0.2
# forward . __PILLAR__CLUSTER__DNS__ {
# force_tcp
# }
# prometheus :9253
# }
.:53 {
errors
cache 30
reload
loop
bind 169.254.20.10 172.26.0.2
forward . __PILLAR__UPSTREAM__SERVERS__
prometheus :9253
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-local-dns
namespace: kube-system
labels:
k8s-app: node-local-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
updateStrategy:
rollingUpdate:
maxUnavailable: 10%
selector:
matchLabels:
k8s-app: node-local-dns
template:
metadata:
labels:
k8s-app: node-local-dns
annotations:
prometheus.io/port: "9253"
prometheus.io/scrape: "true"
spec:
imagePullSecrets:
- name: regcred
priorityClassName: system-node-critical
serviceAccountName: node-local-dns
hostNetwork: true
dnsPolicy: Default # Don't use cluster DNS.
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
- effect: "NoExecute"
operator: "Exists"
- effect: "NoSchedule"
operator: "Exists"
containers:
- name: node-cache
image: xxx.lan:5000/k8s-dns-node-cache:1.16.0
resources:
requests:
cpu: 25m
memory: 10Mi
args: [ "-localip", "169.254.20.10,172.26.0.2", "-conf", "/etc/Corefile", "-upstreamsvc", "kube-dns-upstream", "-health-port","8070" ]
securityContext:
privileged: true
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9253
name: metrics
protocol: TCP
livenessProbe:
httpGet:
host: 169.254.20.10
path: /health
port: 8070
initialDelaySeconds: 40
timeoutSeconds: 3
volumeMounts:
- mountPath: /run/xtables.lock
name: xtables-lock
readOnly: false
- name: config-volume
mountPath: /etc/coredns
- name: kube-dns-config
mountPath: /etc/kube-dns
volumes:
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: kube-dns-config
configMap:
name: kube-dns
optional: true
- name: config-volume
configMap:
name: node-local-dns
items:
- key: Corefile
path: Corefile.base
---
# A headless service is a service with a service IP but instead of load-balancing it will return the IPs of our associated Pods.
# We use this to expose metrics to Prometheus.
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "9253"
prometheus.io/scrape: "true"
labels:
k8s-app: node-local-dns
name: node-local-dns
namespace: kube-system
spec:
clusterIP: None
ports:
- name: metrics
port: 9253
targetPort: 9253
selector:
k8s-app: node-local-dns
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: coredns
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: Reconcile
name: system:coredns
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
addonmanager.kubernetes.io/mode: EnsureExists
name: system:coredns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
labels:
addonmanager.kubernetes.io/mode: EnsureExists
data:
Corefile: |
.:153 {
errors
health :8180
kubernetes cluster1.local. in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
annotations:
seccomp.security.alpha.kubernetes.io/pod: 'docker/default'
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: kubernetes.io/hostname
hostNetwork: true
priorityClassName: system-cluster-critical
serviceAccountName: coredns
nodeSelector:
node-role.kubernetes.io/master: "true"
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
- key: "CriticalAddonsOnly"
operator: "Exists"
imagePullSecrets:
- name: regcred
containers:
- name: coredns
image: xxxx.lan:5000/coredns:1.7.1
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 270Mi
requests:
cpu: 100m
memory: 150Mi
args: [ "-conf", "/etc/coredns/Corefile" ]
volumeMounts:
- name: config-volume
mountPath: /etc/coredns
readOnly: true
ports:
- containerPort: 153
name: dns
protocol: UDP
- containerPort: 153
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 8180
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
dnsPolicy: Default
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9153"
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
kubernetes.io/name: "CoreDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 172.26.0.2
ports:
- name: dns
port: 53
targetPort: 153
protocol: UDP
- name: dns-tcp
port: 53
targetPort: 153
protocol: TCP

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK