Using PyTorch on ClearLinux docker image
source link: https://donghao.org/2022/06/22/using-pytorch-on-clearlinux-docker-image/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Using PyTorch on ClearLinux docker image
I am using Nvidia’s official docker image of PyTorch for my model training for quite a long time. It works very well but the only problem is that the image is too large: more than 6GB. In my poor home network, it would cost a painfully long time to download.
Yesterday, an interesting idea jumped out to my mind: why not build my own small docker image to use PyTorch? Then I started to do it.
Firstly, I chose ClearLinux from Intel since it is very clean and used all state-of-art software (which means its performance is fabulous).
I used distrobox to create my environment:
distrobox create --image clearlinux:latest \ --name robin_clear \ --home /home/robin/clearlinux \ --additional-flags "--shm-size=4g" \ --additional-flags "--gpus all" \ --additional-flags "--device=/dev/nvidiactl" \ --additional-flags "--device=/dev/nvidia0"
distrobox create --image clearlinux:latest \
--name robin_clear \
--home /home/robin/clearlinux \
--additional-flags "--shm-size=4g" \
--additional-flags "--gpus all" \
--additional-flags "--device=/dev/nvidiactl" \
--additional-flags "--device=/dev/nvidia0"
Enter the environment:
distrobox enter robin_clear
distrobox enter robin_clear
Download CUDA-11.03 run file
and install it in the robin_clear
:
sudo swupd bundle-add libxml2 sudo ./cuda_11.3.0_465.19.01_linux.run \ --toolkit \ --no-man-page \ --override \ --silent
sudo swupd bundle-add libxml2
sudo ./cuda_11.3.0_465.19.01_linux.run \
--toolkit \
--no-man-page \
--override \
--silent
Then, the important part: install gcc-10 (ClearLinux included gcc-12, which is too high for CUDA-11.03) and create the symbol links for it
sudo swupd bundle-add c-extras-gcc10 sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++
sudo swupd bundle-add c-extras-gcc10
sudo ln -s /usr/bin/gcc-10 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-10 /usr/local/cuda/bin/g++
Install the PyTorch:
sudo swupd bundle-add python3-basic pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
sudo swupd bundle-add python3-basic
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Install the apex for mixed-precision training (because my model training is using it):
git clone https://github.com/NVIDIA/apex cd apex pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
git clone https://github.com/NVIDIA/apex
cd apex
pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Now I can run my training in the ClearLinux. The comparison of these two docker images is here:
CUDA Version | PyTorch Version | Docker image size | VRAM Usage | Time for training one batch | |
Nvidia Official PyTorch Image | 11.7 | 1.12.0 | 14.7 GB | 10620MB | 0.2745 seconds |
My ClearLinux Image | 11.3 | 1.11.0 | 12.8 GB | 10936MB | 0.3066 seconds |
My ClearLinux Image(v2) | 11.3 | 1.12.0 | 12.8 GB | 10964MB | 0.2812 seconds |
My ClearLinux Image(build PyTorch myself) | 11.7 | 1.13.0 | 12.8 GB | 10658MB | 0.2716 seconds |
Looks like Nvidia works better than me The only chance to win it is by using the newest CUDA and building state-of-art PyTorch manually.
Related Posts
- Image pull policy in Kubernetes
Recently, we use Kubernetes for our project. Yesterday, a problem haunted me severely: even I…
- Debug CUDA error for PyTorch
After I changed my dataset for my code, the training failed: /tmp/pip-req-build-_tx3iysr/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:310: operator(): block: [0,0,0],…
- Run docker on centos6
Docker use thin-provision of device mapper as its default storage, therefore if we wan't run…
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK