6

Tensorflow出现CUDNN_STATUS_INTERNAL_ERROR的解决方法

 2 years ago
source link: https://allenwind.github.io/blog/12321/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Mr.Feng Blog

NLP、深度学习、机器学习、Python、Go

Tensorflow出现CUDNN_STATUS_INTERNAL_ERROR的解决方法

Tensorflow出现CUDNN_STATUS_INTERNAL_ERROR如何解决?

发现在最近安装的Ubuntu20.4上使用TensorFlow偶尔会出现如下错误,

I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6253 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
Epoch 1/12
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:328] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

这个CUDNN_STATUS_INTERNAL_ERROR导致模型训练无法使用GPU,只能在CPU上跑。我看Github上有解决方法是重启主机,但是这个万能重启打法并不友好。

这里有一种解决方案。有些情况下是因为硬件和系统的问题,导致TensorFlow无法连接到GPU,可以通过重启NVIDIA内核解决,

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

我们也可以把这段小代码写到.bashrc上,

restart_gpu() {
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
}

这样遇到CUDNN_STATUS_INTERNAL_ERROR问题时直接在命令行上执行restart_gpu即可。

如果以上尝试桌面版系统如果还是无办法解决,重启主机即可。

转载请包括本文地址:https://allenwind.github.io/blog/12321
更多文章请参考:https://allenwind.github.io/blog/archives/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK