cuda2 but found one of them on device cuda 0_device = "cuda:0,1-CSDN博客

本文链接：https://blog.csdn.net/m0_37052320/article/details/120448343

cuda2 but found one of them on device cuda 0

上面问题的重点是，不是cpu和gpu之间的常规问题，而是出现在不同gpu之间的数据。
出问题地方：

model = nn.DataParallel(model, device_ids=[2,4])

先说解决方法：
方案一：

model = nn.DataParallel(model, device_ids=[0，2,4])

前面一定要给一个0的序号，如果0被别人占了呢？
方案二：

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "2, 4"
model = nn.DataParallel(model, device_ids=[0,1])

首先两行指定了可见的gpu，然后再重新parallel的时候，就使用0，1（逻辑顺序）代替原来的2，4（实际物理顺序）

原因是DataParallel的缺省参数：

CLASS torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

有个output_device默认是0，即使前面设置的不包含0，这个参数也会使得0被调用，所以导致出现这种没有指定，但偏偏出现的cuda 0上的错误：“cuda2 but found one of them on device cuda 0”