32

Trace memory error of CUDA program

 3 years ago
source link: http://www.donghao.org/2021/05/14/trace-memory-error-of-cuda-program/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Trace memory error of CUDA program

The program which used CUDA for computing in GPU reported error about memory:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered LightGBM/src/treelearner/cuda_tree_learner.cpp 239
Python
xxxxxxxxxx
terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered LightGBM/src/treelearner/cuda_tree_learner.cpp 239

For common C++ program, we use gdb for debugging. For CUDA program, we should use cuda-gdb. Make sure to compile CUDA code with -g flag and then run:

/usr/local/cuda-11.0/bin/cuda-gdb python3
(cuda-gdb) run test.py
Python
xxxxxxxxxx
/usr/local/cuda-11.0/bin/cuda-gdb python3
(cuda-gdb) run test.py

After a while, we could see the exact memory corrupt position of the code:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x1668b2f0 (histogram_16_64_256.cu:182)

Thread 1 "python3" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 10, block (2163,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x000000001668b380 in LightGBM::histogram16<<<(7360,1,1),(16,1,1)>>> () at LightGBM/src/treelearner/kernels/histogram_16_64_256.cu:185
185            feature = (feature >> ((ind & 1) << 2)) & 0xf;
Python
xxxxxxxxxx
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x1668b2f0 (histogram_16_64_256.cu:182)
Thread 1 "python3" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 10, block (2163,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x000000001668b380 in LightGBM::histogram16<<<(7360,1,1),(16,1,1)>>> () at LightGBM/src/treelearner/kernels/histogram_16_64_256.cu:185
185            feature = (feature >> ((ind & 1) << 2)) & 0xf;

Like this:

Loading...

Related

12:57 am ROBIN DONG develope
CUDA
Leave a comment

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK