3

Upgrade ubuntu to solve a GPU problem

 1 year ago
source link: https://donghao.org/2023/03/17/upgrade-ubuntu-to-solve-a-gpu-problem/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Upgrade ubuntu to solve a GPU problem

After installing an RTX 2080 TI on an old-2016-desktop at the beginning of 2019, we used it to train YOLOv6 for a while. But recently the training job will occasionally hang and the GPU stops working. The only message I can see is from dmesg

[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46
[ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
Plain Text
[ 8104.078794] NVRM: GPU at PCI:0000:01:00: GPU-b4f425ef-2d0f-f29e-5624-ff96b37c2c46
[ 8104.078796] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 8104.078797] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 8104.078803] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

At first, I suspected the NVIDIA driver was too new. But after installing back to an older driver, the same errors jumped out in dmesg. And the problem seems to occur more frequently, sometimes could not hold more than 24 hours.

Considering that Ubuntu 18.04 is too old (also the Linux kernel), I tried to upgrade it. Actually, although I installed a lot of Linux systems and Linux kernels in different machines (servers, desktops, laptops, and even the development board), this is the first time I upgraded an existing Ubuntu system.

By following the guide, I barely upgrade from 18.04 to 20.04. Surprisingly, the new system works well with the older NVIDIA driver and the GPU works smoothly for more than 12 hours now.

In conclusion, we should use a new system (new kernel) with new hardware drivers. If the training job doesn’t report any error, I will go on using this 20.04 and saving the time of upgrading to 22.04

Related Posts

  • Using GPU for LightGBM

    One of my team members had accomplished some tests on using GPU for LightGBM training.…

  • Upgrade GKE cluster

    Normally, to upgrade a cluster of Google Kubernetes Engine, we need to upgrade the master…

  • Do tf.random_crop() operation on GPU

    When I run code like: with tf.device('/GPU:0'): images = tf.random_crop(images, [IMAGE_HEIGHT, IMAGE_WIDTH, IMAGE_CHANNELS]) ... it…

March 17, 2023 - 0:52 RobinDong ops
Ubuntu
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK