pod sandbox 创建失败
source link: https://qingwave.github.io/pod-sandbox-recreated/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Mar 18, 2020 · cloud
pod sandbox 创建失败
今天在 k8s 更新服务时,发现 pod 启动失败,报错failed to start sandbox container
,如下所示:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28m default-scheduler Successfully assigned kube-system/k8s-proxy-7wkt4 to tj1-staging-com-ocean007-201812.kscn
Warning FailedCreatePodSandBox 28m (x13 over 28m) kubelet, tj1-staging-com-ocean007-201812.kscn Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "k8s-proxy-7wkt4": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:297: getting the final child's pid from pipe caused \"EOF\"": unknown
Normal SandboxChanged 3m19s (x1364 over 28m) kubelet, tj1-staging-com-ocean007-201812.kscn Pod sandbox changed, it will be killed and re-created.
sandbox 创建失败只是表象,是宿主机其他异常导致的,一般是(cpu,diskio,mem)导致的.
首先,上节点看 kubelet,docker 有无异常,日志没有明显错误,通过top
看到 docker cpu 占用非常高
[root@tj1-staging-com-ocean007-201812 ~]# top
top - 17:55:00 up 265 days, 3:41, 1 user, load average: 10.71, 11.34, 10.76
Tasks: 816 total, 5 running, 811 sleeping, 0 stopped, 0 zombie
%Cpu(s): 24.0 us, 34.5 sy, 0.0 ni, 41.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65746380 total, 20407940 free, 11007040 used, 34331400 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 49134416 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
115483 root 20 0 3965212 273188 34564 S 489.7 0.4 382260:40 dockerd
1367523 root 20 0 18376 2972 2716 R 66.9 0.0 20163:45 bash
1367487 root 20 0 11856 5616 4512 S 54.0 0.0 16748:26 containerd-shim
3200169 root 20 0 1300 4 0 R 53.3 0.0 14913:49 sh
2429952 root 20 0 1300 4 0 S 49.3 0.0 9620:56 sh
3200130 root 20 0 9392 4756 3884 S 47.7 0.0 13417:30 containerd-shim
3718475 root 20 0 1300 4 0 R 47.4 0.0 8600:20 sh
3718440 root 20 0 10736 5516 4512 S 42.1 0.0 7575:31 containerd-shim
2429917 root 20 0 11856 5556 4512 S 40.1 0.0 8313:22 containerd-shim
3205493 root 20 0 3775924 230996 66704 S 18.9 0.4 2559:07 kubelet
1 root 20 0 195240 157000 3932 S 7.9 0.2 1417:46 systemd
804 dbus 20 0 30308 6460 2464 S 1.7 0.0 462:18.84 dbus-daemon
1011737 root 20 0 277656 122788 18428 S 1.3 0.2 768:03.00 cadvisor
115508 root 20 0 7139200 32896 24288 S 1.0 0.1 662:25.27 containerd
806 root 20 0 24572 3060 2480 S 0.7 0.0 171:22.52 systemd-logind
511080 root 0 -20 2751348 52552 15744 S 0.7 0.1 178:27.51 sagent
1102507 root 20 0 11792 7292 4512 S 0.7 0.0 23:36.37 containerd-shim
1272223 root 20 0 164800 5296 3824 R 0.7 0.0 0:00.38 top
2866292 root 20 0 5045000 1.983g 3080 S 0.7 3.2 230:09.47 redis
同时, cpu system 异常高.
%Cpu(s): 24.0 us, 34.5 sy, 0.0 ni, 41.4 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
按照以前的经验,一般是由某些容器引起的,通过top
看到个别sh
进程占用 cpu 较高.
通过ps
看到进程居然是个死循环
[root@tj1-staging-com-ocean007-201812 ~]# ps -ef |grep 1367523
root 1287628 1247781 0 17:55 pts/1 00:00:00 grep --color=auto 1367523
root 1367523 1367504 72 Feb28 ? 14-00:04:17 /bin/bash -c while true; do echo hello; done
通过/proc/pid/cgroup
找到对应容器
# cat /proc/1367523/cgroup
11:freezer:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
10:devices:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
9:hugetlb:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
8:blkio:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
7:memory:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
6:perf_event:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
5:cpuset:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
4:pids:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
3:net_cls,net_prio:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
2:cpu,cpuacct:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
1:name=systemd:/kubepods/besteffort/pod55d3adf2-67f7-11ea-93f2-246e968203b8/29842d5544b701dbb5ff647dba19bb4ebec821edc6ee1ffbd7aeee58fa5038fd
找到对应容器
docker ps | grep 29842d554
清理完相关 pod 后,系统恢复正常
top - 18:25:57 up 265 days, 4:12, 1 user, load average: 1.05, 1.24, 4.02
Tasks: 769 total, 1 running, 768 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.7 us, 0.9 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65746380 total, 22106960 free, 10759860 used, 32879560 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 49401576 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3205493 root 20 0 3775924 229844 66704 S 9.9 0.3 2563:18 kubelet
115483 root 20 0 3965468 249124 34564 S 7.9 0.4 382323:36 dockerd
1 root 20 0 195240 157000 3932 S 6.3 0.2 1419:48 systemd
804 dbus 20 0 30308 6460 2464 S 2.0 0.0 462:51.51 dbus-daemon
3085322 root 20 0 12.045g 1.578g 19028 S 1.3 2.5 767:51.19 java
115508 root 20 0 7139200 32264 24288 S 1.0 0.0 662:42.18 containerd
511080 root 0 -20 2751348 42116 15744 S 1.0 0.1 178:44.79 sagent
1011737 root 20 0 277656 111836 18428 S 1.0 0.2 768:49.01 cadvisor
1523167 root 20 0 164800 5436 4012 R 0.7 0.0 0:00.04 top
3199459 root 20 0 1554708 43668 9496 S 0.7 0.1 28:50.60 falcon-agent
7 root 20 0 0 0 0 S 0.3 0.0 619:07.64 rcu_sched
806 root 20 0 24572 3060 2480 S 0.3 0.0 171:33.69 systemd-logind
11921 root 20 0 94820 20480 5840 S 0.3 0.0 1402:42 consul
575838 root 20 0 411464 17092 7364 S 0.3 0.0 15:16.25 python
856593 root 20 0 1562392 37912 9612 S 0.3 0.1 21:34.23 falcon-agent
931957 33 20 0 90728 3392 1976 S 0.3 0.0 0:51.23 nginx
1212186 root 20 0 0 0 0 S 0.3 0.0 0:01.12 kworker/14:1
1726228 root 20 0 9392 4496 3808 S 0.3 0.0 0:00.67 containerd-shim
1887128 root 20 0 273160 7932 3128 S 0.3 0.0 46:05.23 redis-server
2788111 root 20 0 273160 6300 3080 S 0.3 0.0 25:18.55 redis-server
3199297 root 20 0 1563160 44812 9624 S 0.3 0.1 31:13.73 falcon-agent
sandox 创建失败的原因是各种各样的, 如[memory 设置错误触发的异常][1],[dockerd 异常][2].
针对此处问题是由于某些测试 pod 通过while true; do echo hello; done
启动,死循环一直echo hello
产生大量read()
系统调用,所在 cpu 飙升.多个类似 pod 导致系统非常繁忙,无法正常处理其他请求.
此类问题不容易在 pod 创建时直接检测到,只能通过添加物理节点相关报警(dockerd cpu 使用率, node cpu.sys 使用率)及时发现问题.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK