解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“

最新推荐文章于 2024-10-31 15:44:48 发布

Polaris_T

最新推荐文章于 2024-10-31 15:44:48 发布

阅读量2.5k

点赞数 1

文章标签： pytorch 人工智能 python

本文链接：https://blog.csdn.net/qq_45717425/article/details/130088045

版权

早上做消融实验的时候需要复现俩月前的实验结果，但是莫名其妙同样的代码和环境却跑不通了，会在loss.backward()的时候报如下错误：
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the ``forward`` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple ``checkpoint`` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

在网上找了一圈发现一共也没几个人问过这个报错，其中stackoverflow上有人解决了这问题，说是把find_unused_parameters设置为false就莫名其妙好了，但是我这么设置之后在固定D训练G的时候又报错：之前写代码时碰到了这样一个错误：
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
在这里插入图片描述
但解决“Expected to have finished reduction in the prior iteration before starting a new one”又涉及到把find_unused_parameters设置为True，这看起来直接和上面的解决方法矛盾了…