解决PyTorch DDP: Finding the cause of “Expected to mark a variable ready only once“

早上做消融实验的时候需要复现俩月前的实验结果,但是莫名其妙同样的代码和环境却跑不通了,会在loss.backward()的时候报如下错误:
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the ``forward`` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. 2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple ``checkpoint`` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases yet.

在网上找了一圈发现一共也没几个人问过这个报错,其中stackoverflow上有人解决了这问题,说是把find_unused_parameters设置为false就莫名其妙好了,但是我这么设置之后在固定D训练G的时候又报错:之前写代码时碰到了这样一个错误:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
在这里插入图片描述
但解决“Expected to have finished reduction in the prior iteration before starting a new one”又涉及到把find_unused_parameters设置为True,这看起来直接和上面的解决方法矛盾了…

最后实在是无可奈何,觉得可能是因为单卡采用分布式训练出了啥问题,于是干脆把ddp给关了再训练,报错才得以解决,太玄学了。。。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Polaris_T

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值