7

国产系统上解决docker pull卡住的问题

 1 year ago
source link: https://zhangguanzhang.github.io/2023/04/12/docker-pull-hang/#/%E5%90%8E%E7%BB%AD
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

国产系统上解决docker pull卡住的问题



字数统计: 1.1k阅读时长: 5 min
 2023/04/12  431  Share

今天客户 arm64 机器上 docker pull 大镜像卡住的一次解决过程

同事拉我看一个客户现场 docker 镜像无法拉取的问题,故障如下会一直卡住:

$ docker pull xxx:5000/xxxx
xxx: Pulling from xxx/xxxxxx
7c0b344a74c2: Extracting [> ] 294.9kB/26.66MB
7c0b344a74c2: Download complete
e53ed7fd3110: Download complete
d2cae797bc79: Download complete
ec3ddc176f08: Download complete
2969517e196e: Download complete
097fa64722e8: Download complete
1dde4ca01a5a: Download complete

离线文件 load -i 后,tag 后推送到仓库上,本地删掉这个镜像,然后拉取还是上面这样卡住,部分小镜像拉取是没问题的,所以不可能是 docker data-root 的挂载 option 影响。环境信息如下:

$ docker info
...
Server Version: 19.03.15
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ea765aba0d05254012b0b9e595e995c09186427f
runc version: v1.0.0-0-g84113eef
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.19.90-2211.5.0.0178.22.uel20.aarch64
Operating System: UnionTech OS Server 20
OSType: linux
Architecture: aarch64
CPUs: 24
Total Memory: 94.56GiB
Name: host-xxxx
ID: RTQS:5TXE:5T3S:YW7X:OHPK:FZ7D:7EHD:DH5Z:JNBV:FVXS:24FA:EIVS
Docker Root Dir: /data/kube/docker
Debug Mode: true
File Descriptors: 29
Goroutines: 46
System Time: 2023-04-12T16:10:25.33362426+08:00
$ uname -a
Linux host-x 4.19.90-2211.5.0.0178.22.uel20.aarch64 #1 SMP Thu Nov 24 10:33:07 CST 2022 aarch64 aarch64 aarch64 GNU/Linux
$ cat /etc/os-release
PRETTY_NAME="UnionTech OS Server 20"
NAME="UnionTech OS Server 20"
VERSION_ID="20"
VERSION="20"
ID=uos
HOME_URL="https://www.chinauos.com/"
BUG_REPORT_URL="https://bbs.chinauos.com/"
VERSION_CODENAME=fuyu
PLATFORM_ID="platform:uel20"

卡住的过程中在开一个 ssh top 看到了有进程 unpigz 占用比较高,利用它 pid 查看了下一些信息:

$ pstree -sp 1170083
systemd(1)───dockerd(1169795)───unpigz(1170083)─┬─{unpigz}(1170084)
├─{unpigz}(1170086)
└─{unpigz}(1170087)

发现是 docker 调用它的,strace 只能看到卡住, kill 了 unpigz 后,卡住的 pull 报错:

failed to register layer: Error processing tar file(exit status 1): unexpected EOF

docker 的镜像每层 layer 实际就是 tar , pull 的时候都是下载 tar 包后解压,这个看着是解压相关出现了问题,在 docker 源码里搜了下 Error processing tar file 后找到

// https://github.com/moby/moby/blob/v19.03.15/pkg/chrootarchive/archive_unix.go#L90-L116
cmd := reexec.Command("docker-untar", dest, root)
...
if err := cmd.Wait(); err != nil {
// when `xz -d -c -q | docker-untar ...` failed on docker-untar side,
// we need to exhaust `xz`'s output, otherwise the `xz` side will be
// pending on write pipe forever
io.Copy(ioutil.Discard, decompressedArchive)

return fmt.Errorf("Error processing tar file(%v): %s", err, output)
}
return nil

看注释里的 xz -d -c -q | docker-untar ... ,看了下 unpigz 的 cmdline 和确实有一个卡住的 docker-untar 进程

$ xargs -0 < /proc/1170083/cmdline
/usr/bin/unpigz -d -c
$ ps aux | grep docker-unta[r]
root 1164788 0.0 0.0 1491008 39488 pts/2 Sl+ 15:21 0:00 docker-untar / /data/kube/docker/overlay2/546b7b992b53b243450807b8150c4a1905e93afae604da69a21bbaaf443f178e/diff

看着是 exec 调用 unpigz 解压管道给 reexec 注册的 docker-untar,而上面的 unpigz 进程树看到是 docker 默认调用的而非 xz ,搜了下,发现 unpigz 是一个在 gz 格式处理上比 gzip 更快的实现。既然 docker 是 exec 调用的 unpigz ,那就在源码里搜索下它看看:

// https://github.com/moby/moby/blob/v19.03.15/pkg/archive/archive.go#L32-L39
func init() {
if path, err := exec.LookPath("unpigz"); err != nil {
logrus.Debug("unpigz binary not found in PATH, falling back to go gzip library")
} else {
logrus.Debugf("Using unpigz binary found at path %s", path)
unpigzPath = path
}
}

往下翻看,发现 unpigzPath 的 exec 调用地方:

// https://github.com/moby/moby/blob/v19.03.15/pkg/archive/archive.go#L160-L174
func gzDecompress(ctx context.Context, buf io.Reader) (io.ReadCloser, error) {
if unpigzPath == "" {
return gzip.NewReader(buf)
}

disablePigzEnv := os.Getenv("MOBY_DISABLE_PIGZ")
if disablePigzEnv != "" {
if disablePigz, err := strconv.ParseBool(disablePigzEnv); err != nil {
return nil, err
} else if disablePigz {
return gzip.NewReader(buf)
}
}

return cmdStream(exec.CommandContext(ctx, unpigzPath, "-d", "-c"), buf)
}

现场 unpigz 版本:

$ rpm -qf /bin/unpigz
pigz-2.4-7.uel20.01.aarch64
$ rpm -V pigz
# -V 查看包也没被修改

注意看其中有个 env 设置不使用 PIGZ 而是使用 gzip ,然后启动 docker daemon 的时候设置这个 env 就可以拉取镜像了:

$ systemctl stop docker
# 临时命令行前台 debug 启动下看看是没问题的
$ MOBY_DISABLE_PIGZ=true dockerd --debug

uos 这个系统是要授权才能使用 yum 安装升级,去 repo 里的 url 访问报错 401,只有让客户联系 uos 厂商升级 pigz 包先,不能解决再使用 MOBY_DISABLE_PIGZ


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK