Using PyTorch on ClearLinux docker image

I am using Nvidia’s official docker image of PyTorch for my model training for quite a long time. It works very well but the only problem is that the image is too large: more than 6GB. In my poor home network, it would cost a painfully long time to download.

Yesterday, an interesting idea jumped out to my mind: why not build my own small docker image to use PyTorch? Then I started to do it.

Firstly, I chose ClearLinux from Intel since it is very clean and used all state-of-art software (which means its performance is fabulous).

I used distrobox to create my environment:

distrobox create --image clearlinux:latest \
    --name robin_clear \
    --home /home/robin/clearlinux \
    --additional-flags "--shm-size=4g" \
    --additional-flags "--gpus all" \
    --additional-flags "--device=/dev/nvidiactl" \
    --additional-flags "--device=/dev/nvidia0"

Python

distrobox create --image clearlinux:latest \

    --name robin_clear \

    --home /home/robin/clearlinux \

    --additional-flags "--shm-size=4g" \

    --additional-flags "--gpus all" \

    --additional-flags "--device=/dev/nvidiactl" \

    --additional-flags "--device=/dev/nvidia0"

Enter the environment:

distrobox enter robin_clear

Python

distrobox enter robin_clear

Download CUDA-11.03 run file and install it in the robin_clear:

sudo swupd bundle-add libxml2

sudo ./cuda_11.3.0_465.19.01_linux.run \
        --toolkit \
        --no-man-page \
        --override \
        --silent

Python

sudo swupd bundle-add libxml2

sudo ./cuda_11.3.0_465.19.01_linux.run \

        --toolkit \

        --no-man-page \

        --override \

        --silent

Then, the important part: install gcc-10 (ClearLinux included gcc-12, which is too high for CUDA-11.03) and create the symbol links for it

sudo swupd bundle-add c-extras-gcc10
sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++

Python

sudo swupd bundle-add c-extras-gcc10

sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc

sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++

Install the PyTorch:

sudo swupd bundle-add python3-basic

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Python

sudo swupd bundle-add python3-basic

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Install the apex for mixed-precision training (because my model training is using it):

git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Python

git clone https://github.com/NVIDIA/apex

cd apex

pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Now I can run my training in the ClearLinux. The comparison of these two docker images is here:

	CUDA Version	PyTorch Version	Docker image size	VRAM Usage	Time for training one batch
Nvidia Official PyTorch Image	11.7	1.12.0	14.7 GB	10620MB	0.2745 seconds
My ClearLinux Image	11.3	1.11.0	12.8 GB	10936MB	0.3066 seconds
My ClearLinux Image(v2)	11.3	1.12.0	12.8 GB	10964MB	0.2812 seconds
My ClearLinux Image(build PyTorch myself)	11.7	1.13.0	12.8 GB	10658MB	0.2716 seconds

Looks like Nvidia works better than me The only chance to win it is by using the newest CUDA and building state-of-art PyTorch manually.

Image pull policy in Kubernetes
Recently, we use Kubernetes for our project. Yesterday, a problem haunted me severely: even I…
Debug CUDA error for PyTorch
After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0],…
Run docker on centos6
Docker use thin-provision of device mapper as its default storage, therefore if we wan't run…

Using PyTorch on ClearLinux docker image

Using PyTorch on ClearLinux docker image

Related Posts

Recommend

How to Install and Setup Flutter in IntelliJ IDEA

【CV晨读】特斯拉3个月内裁员10%，盒马采购线裁撤收归总部

Amazon announced its first 'fully autonomous' warehouse robot that doesn't have...

安吉政府领投、携程集团跟投，订单来了完成5000万元B轮融资

2022年中国私募基金行业市场现状及竞争格局分析私募基金行业蓬勃发展【组图】

“元宇宙”概念助力数字藏品引领文旅消费新模

恒大脱困，到了哪一步？

李卫伟入选“广州市宣传思想文化领军人才”，成唯一入选企业家

Telegram Premium: Features, costs, and should you upgrade?

Google News launches a new desktop design with topic customization

About Joyk