1

Kubernetes集群部署Node Feature Discovery组件用于检测集群节点特性 - 人艰不拆_zmc

 6 months ago
source link: https://www.cnblogs.com/zhangmingcheng/p/18072751
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Node Feature Discovery(NFD)是由Intel创建的项目,能够帮助Kubernetes集群更智能地管理节点资源。它通过检测每个节点的特性能力(例如CPU型号、GPU型号、内存大小等)并将这些能力以标签的形式发送到Kubernetes集群的API服务器(kube-apiserver)。然后,通过kube-apiserver修改节点的标签。这些标签可以帮助调度器(kube-scheduler)更智能地选择最适合特定工作负载的节点来运行Pod。

Github:https://github.com/kubernetes-sigs/node-feature-discovery
Docs:https://kubernetes-sigs.github.io/node-feature-discovery/master/get-started/index.html

2、组件架构

NFD 细分为 NFD-Master 和 NFD-Worker 两个组件:

NFD-Master:是一个负责与 kubernetes API Server 通信的Deployment Pod,它从 NFD-Worker 接收节点特性并相应地修改 Node 资源对象(标签、注解)。

NFD-Worker:是一个负责对 Node 的特性能力进行检测的 Daemon Pod,然后它将信息传递给 NFD-Master,NFD-Worker 应该在每个 Node 上运行。

可以检测发现的硬件特征源(feature sources)清单包括:

  • IOMMU
  • Kernel
  • Memory
  • Network
  • Storage
  • System
  • Custom (rule-based custom features)
  • Local (hooks for user-specific features)

 3、组件安装

(1)安装前查看集群节点状态

[root@master-10 ~]# kubectl get nodes
NAME                  STATUS   ROLES                         AGE   VERSION
master-10.20.31.105   Ready    control-plane,master,worker   31h   v1.21.5

节点详细信息,主要关注标签、注解。

[root@master-10 ~]# kubectl describe nodes master-10.20.31.105
Name:               master-10.20.31.105
Roles:              control-plane,master,worker
Labels:             beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=master-10.20.31.105
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:fb:4b:8a:bb:12"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.20.31.105
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 12 Mar 2024 21:01:31 -0400
Taints:             <none>
........

 (2)组件安装

[root@master-10 opt]# kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.14.2
namespace/node-feature-discovery created
customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io created
customresourcedefinition.apiextensions.k8s.io/nodefeatures.nfd.k8s-sigs.io created
serviceaccount/nfd-master created
serviceaccount/nfd-worker created
role.rbac.authorization.k8s.io/nfd-worker created
clusterrole.rbac.authorization.k8s.io/nfd-master created
rolebinding.rbac.authorization.k8s.io/nfd-worker created
clusterrolebinding.rbac.authorization.k8s.io/nfd-master created
configmap/nfd-master-conf created
configmap/nfd-worker-conf created
service/nfd-master created
deployment.apps/nfd-master created
daemonset.apps/nfd-worker created

(3)查看组件状态

[root@master-10 opt]# kubectl get pods -n=node-feature-discovery
NAME                          READY   STATUS    RESTARTS   AGE
nfd-master-5c4684f5cb-hvjjb   1/1     Running   0          4m11s
nfd-worker-cpwx6              1/1     Running   0          4m11s

(4)查看组件日志

可以看到nfd-worker组件默认每隔一分钟检测一次节点特性。

[root@master-10 ~]# kubectl logs -f -n=node-feature-discovery nfd-worker-rlf5t
I0314 06:30:32.003264       1 main.go:66] "-server is deprecated, will be removed in a future release along with the deprecated gRPC API"
I0314 06:30:32.003372       1 nfd-worker.go:219] "Node Feature Discovery Worker" version="v0.14.2" nodeName="master-10.20.31.105" namespace="node-feature-discovery"
I0314 06:30:32.003589       1 nfd-worker.go:520] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-worker.conf"
I0314 06:30:32.004500       1 nfd-worker.go:552] "configuration successfully updated" configuration={"Core":{"Klog":{},"LabelWhiteList":{},"NoPublish":false,"FeatureSources":["all"],"Sources":null,"LabelSources":["all"],"SleepInterval":{"Duration":60000000000}},"Sources":{"cpu":{"cpuid":{"attributeBlacklist":["BMI1","BMI2","CLMUL","CMOV","CX16","ERMS","F16C","HTT","LZCNT","MMX","MMXEXT","NX","POPCNT","RDRAND","RDSEED","RDTSCP","SGX","SGXLC","SSE","SSE2","SSE3","SSE4","SSE42","SSSE3","TDX_GUEST"]}},"custom":[],"fake":{"labels":{"fakefeature1":"true","fakefeature2":"true","fakefeature3":"true"},"flagFeatures":["flag_1","flag_2","flag_3"],"attributeFeatures":{"attr_1":"true","attr_2":"false","attr_3":"10"},"instanceFeatures":[{"attr_1":"true","attr_2":"false","attr_3":"10","attr_4":"foobar","name":"instance_1"},{"attr_1":"true","attr_2":"true","attr_3":"100","name":"instance_2"},{"name":"instance_3"}]},"kernel":{"KconfigFile":"","configOpts":["NO_HZ","NO_HZ_IDLE","NO_HZ_FULL","PREEMPT"]},"local":{},"pci":{"deviceClassWhitelist":["03","0b40","12"],"deviceLabelFields":["class","vendor"]},"usb":{"deviceClassWhitelist":["0e","ef","fe","ff"],"deviceLabelFields":["class","vendor","device"]}}}
I0314 06:30:32.004796       1 metrics.go:70] "metrics server starting" port=8081
I0314 06:30:32.019135       1 nfd-worker.go:562] "starting feature discovery..."
I0314 06:30:32.019364       1 nfd-worker.go:577] "feature discovery completed"
I0314 06:31:32.021520       1 nfd-worker.go:562] "starting feature discovery..."
I0314 06:31:32.021695       1 nfd-worker.go:577] "feature discovery completed"
I0314 06:32:32.027970       1 nfd-worker.go:562] "starting feature discovery..."
I0314 06:32:32.028141       1 nfd-worker.go:577] "feature discovery completed"

可以看到nfd-master组件启动后默认第一分钟相应地修改 Node 资源对象(标签、注解),之后是每隔一个小时修改一次 Node 资源对象(标签、注解),也就是说如果一个小时以内用户手动误修改node资源特性信息(标签、注解),最多需要一个小时nfd-master组件才自动更正node资源特性信息。

[root@master-10 ~]# kubectl logs -n=node-feature-discovery nfd-master-5c4684f5cb-hvjjb
I0314 06:23:08.190218       1 nfd-master.go:213] "Node Feature Discovery Master" version="v0.14.2" nodeName="master-10.20.31.105" namespace="node-feature-discovery"
I0314 06:23:08.190356       1 nfd-master.go:1214] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-master.conf"
I0314 06:23:08.190912       1 nfd-master.go:1274] "configuration successfully updated" configuration=<
DenyLabelNs: {}
EnableTaints: false
ExtraLabelNs: {}
Klog: {}
LabelWhiteList: {}
LeaderElection:
LeaseDuration:
Duration: 15000000000
RenewDeadline:
Duration: 10000000000
RetryPeriod:
Duration: 2000000000
NfdApiParallelism: 10
NoPublish: false
ResourceLabels: {}
ResyncPeriod:
Duration: 3600000000000
>
I0314 06:23:08.190928       1 nfd-master.go:1338] "starting the nfd api controller"
I0314 06:23:08.191105       1 node-updater-pool.go:79] "starting the NFD master node updater pool" parallelism=10
I0314 06:23:08.860810       1 metrics.go:115] "metrics server starting" port=8081
I0314 06:23:08.861033       1 component.go:36] [core][Server #1] Server created
I0314 06:23:08.861050       1 nfd-master.go:347] "gRPC server serving" port=8080
I0314 06:23:08.861084       1 component.go:36] [core][Server #1 ListenSocket #2] ListenSocket created
I0314 06:23:09.860886       1 nfd-master.go:694] "will process all nodes in the cluster"
I0314 06:23:09.923362       1 nfd-master.go:1086] "node updated" nodeName="master-10.20.31.105"
I0314 07:23:09.224254       1 nfd-master.go:1086] "node updated" nodeName="master-10.20.31.105"
I0314 08:23:09.081362       1 nfd-master.go:1086] "node updated" nodeName="master-10.20.31.105"

(5)查看节点特性信息

可以看到NFD组件已经把节点特性信息维护到了节点标签、注解上,其中标签前缀默认为 feature.node.kubernetes.io/。

[root@master-10 opt]# kubectl describe node master-10.20.31.105
Name:               master-10.20.31.105
Roles:              control-plane,master,worker
Labels:             beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.HLE=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.RTM=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-hardware_multithreading=false
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=85
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_FULL=true
feature.node.kubernetes.io/kernel-version.full=3.10.0-1160.105.1.el7.x86_64
feature.node.kubernetes.io/kernel-version.major=3
feature.node.kubernetes.io/kernel-version.minor=10
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-0300_15ad.present=true
feature.node.kubernetes.io/system-os_release.ID=centos
feature.node.kubernetes.io/system-os_release.VERSION_ID=7
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=7
kubernetes.io/arch=amd64
kubernetes.io/hostname=master-10.20.31.105
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node-role.kubernetes.io/worker=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:fb:4b:8a:bb:12"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.20.31.105
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
nfd.node.kubernetes.io/master.version: v0.14.2
nfd.node.kubernetes.io/worker.version: v0.14.2
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 12 Mar 2024 21:01:31 -0400

4、组件应用场景

Node Feature Discovery(NFD)组件的主要应用场景是在Kubernetes集群中提供更智能的节点调度。以下是一些NFD的常见应用场景:

  1. 智能节点调度:NFD可以帮助Kubernetes调度器更好地了解节点的特性和资源,从而更智能地选择最适合运行特定工作负载的节点。例如,如果某个Pod需要较强的GPU支持,调度器可以利用NFD标签来选择具有适当GPU型号的节点。

  2. 资源约束和优化:通过将节点的特性能力以标签的形式暴露给Kubernetes调度器,集群管理员可以更好地理解和利用集群中节点的资源情况,从而更好地进行资源约束和优化。

  3. 硬件感知的工作负载调度:对于特定的工作负载,可能需要特定类型或配置的硬件。NFD可以使调度器能够更加智能地选择具有适当硬件特性的节点来运行这些工作负载。

  4. 集群扩展性和性能:通过更智能地分配工作负载到节点,NFD可以提高集群的整体性能和效率。它可以帮助避免资源浪费,并确保工作负载能够充分利用可用的硬件资源。

  5. 集群自动化:NFD可以集成到自动化流程中,例如自动化部署或缩放工作负载。通过使用NFD,自动化系统可以更好地了解节点的特性和资源,从而更好地执行相应的操作。

总的来说,Node Feature Discovery(NFD)可以帮助提高Kubernetes集群的智能程度,使其能够更好地适应各种类型的工作负载和节点特性,从而提高集群的性能、可靠性和效率。

 
624219-20240314162752542-1472069471.png

如果您的 Kubernetes 集群需要根据节点的硬件特性进行智能调度或者对节点的硬件资源进行感知和利用,那么安装 Node Feature Discovery(NFD)是有必要的。然而,如果您的集群中的节点都具有相似的硬件配置,且不需要考虑硬件资源的差异,那么不需要安装 NFD。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK