9

Using PyTorch on ClearLinux docker image

 2 years ago
source link: https://donghao.org/2022/06/22/using-pytorch-on-clearlinux-docker-image/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Using PyTorch on ClearLinux docker image

I am using Nvidia’s official docker image of PyTorch for my model training for quite a long time. It works very well but the only problem is that the image is too large: more than 6GB. In my poor home network, it would cost a painfully long time to download.

Yesterday, an interesting idea jumped out to my mind: why not build my own small docker image to use PyTorch? Then I started to do it.

Firstly, I chose ClearLinux from Intel since it is very clean and used all state-of-art software (which means its performance is fabulous).

I used distrobox to create my environment:

distrobox create --image clearlinux:latest \
    --name robin_clear \
    --home /home/robin/clearlinux \
    --additional-flags "--shm-size=4g" \
    --additional-flags "--gpus all" \
    --additional-flags "--device=/dev/nvidiactl" \
    --additional-flags "--device=/dev/nvidia0"
Python
distrobox create --image clearlinux:latest \
    --name robin_clear \
    --home /home/robin/clearlinux \
    --additional-flags "--shm-size=4g" \
    --additional-flags "--gpus all" \
    --additional-flags "--device=/dev/nvidiactl" \
    --additional-flags "--device=/dev/nvidia0"

Enter the environment:

distrobox enter robin_clear
Python
distrobox enter robin_clear

Download CUDA-11.03 run file and install it in the robin_clear:

sudo swupd bundle-add libxml2

sudo ./cuda_11.3.0_465.19.01_linux.run \
        --toolkit \
        --no-man-page \
        --override \
        --silent
Python
sudo swupd bundle-add libxml2
sudo ./cuda_11.3.0_465.19.01_linux.run \
        --toolkit \
        --no-man-page \
        --override \
        --silent

Then, the important part: install gcc-10 (ClearLinux included gcc-12, which is too high for CUDA-11.03) and create the symbol links for it

sudo swupd bundle-add c-extras-gcc10
sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++
Python
sudo swupd bundle-add c-extras-gcc10
sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++

Install the PyTorch:

sudo swupd bundle-add python3-basic

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Python
sudo swupd bundle-add python3-basic
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Install the apex for mixed-precision training (because my model training is using it):

git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Python
git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Now I can run my training in the ClearLinux. The comparison of these two docker images is here:

CUDA VersionPyTorch VersionDocker image sizeVRAM UsageTime for training one batch
Nvidia Official PyTorch Image11.71.12.014.7 GB10620MB0.2745 seconds
My ClearLinux Image11.31.11.012.8 GB10936MB0.3066 seconds
My ClearLinux Image(v2)11.31.12.012.8 GB10964MB0.2812 seconds
My ClearLinux Image(build PyTorch myself)11.71.13.012.8 GB10658MB0.2716 seconds

Looks like Nvidia works better than me 🙂 The only chance to win it is by using the newest CUDA and building state-of-art PyTorch manually.

Related Posts

  • Image pull policy in Kubernetes

    Recently, we use Kubernetes for our project. Yesterday, a problem haunted me severely: even I…

  • Debug CUDA error for PyTorch

    After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0],…

  • Run docker on centos6

    Docker use thin-provision of device mapper as its default storage, therefore if we wan't run…


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK