25

pytorch的奇怪报错

 2 years ago
source link: https://divertingpan.github.io/post/wh-_qBYkG/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

多卡并行下要注意batch size

在使用DataParallel时,可能会出现这样的报错

IndexError: dimension specified as 0 but tensor has no dimensions

训练时读取dataset时使用drop_last,或者使每次batch的数量一样,总之要控制每次batch的大小是一致的。

参考:https://blog.csdn.net/weixin_44012382/article/details/108190006

代码卡死问题

有时候会遇到代码卡住,检查cpu,gpu,硬盘都没有占用,强退之后会看到有一条信息

gotit = waiter.acquire(True, timeout)

原因是在dataloader里面使用了cv2,需要设置

cv2.setNumThreads(0)
cv2.ocl.setUseOpenCL(False)

同时,在开头段加入

import os
os.environ["OMP_NUM_THREADS"] = "1" 
os.environ["MKL_NUM_THREADS"] = "1" 

ref: https://nekokiku.cn/2021/07/05/dataloader%E7%9A%84%E6%AD%BB%E9%94%81%E9%97%AE%E9%A2%98/

报错超出显存但是实际并没有

场景:在多gpu服务器上,由于0号gpu被占用于是只能使用后面的gpu,但是突然有一天再运行代码时不论如何设置batchsize和数据大小都会报

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

并且前面某处还会有out of memory类似的字眼

经过尝试排查,在指定设备时不使用

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu' )

而使用以下方法代替即可解决

import os 
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu' )

具体机制不明

RuntimeError: received 0 items of ancdata

https://github.com/pytorch/pytorch/issues/973
If you are having this problem, try running torch.multiprocessing.set_sharing_strategy('file_system') right after your import of torch
自己试的时候发现只要放在代码的开头部分就行

多GPU并行情况下:Missing key(s) in state_dict: "conv_1.weight", "bn1.weight", "bn1.bias",

首先使用os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2, 3"指定准备使用的GPU

Net = net(args)
Net = torch.nn.DataParallel(Net)
Net = Net.to(device)

之后按照正常方法使用即可,需要存权重的话,要用下面这个方法保存

torch.save(model.module.state_dict(), model_out_path)

之前如果存权重的时候没在意这一点,就会报错:

RuntimeError: Error(s) in loading state_dict for ResNet:
Missing key(s) in state_dict: "conv_1.weight", "bn1.weight", "bn1.bias", ......
Unexpected key(s) in state_dict: "module.conv_1.weight", "module.bn1.weight", "module.bn1.bias", ......

解决方法是

# original saved file with DataParallel
state_dict = torch.load(model_path)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k.replace('.module.','.') # remove `module.`
    new_state_dict[name] = v
# load params
net.load_state_dict(new_state_dict)

ref: https://blog.csdn.net/weixin_41735859/article/details/108610687

进程退出了但是显存没释放

ref: https://blog.csdn.net/heiheiya/article/details/81454212
显存被占满了,但是并没有进程显示占用。

fuser -v /dev/nvidia*

之后kill -9 xxxxx把/dev/nvidia0等几个进程kill掉就可以了


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK