似-然 2023-06-08 09:58 采纳率: 0%
浏览 11

The same RecvTensor (GrpcWorker) request was received twice

tensorflow+python运行强化学习算法时报错
算法架构为分布式多进程架构,包含1个'ps',2个'worker'
在一个worker训练时,每到第二次sess.run均会报以下错误:

Process Process-6:
Traceback (most recent call last):
  File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 105411384561817065 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;9d0efc4e4612caec;/job:train/replica:0/task:0/device:GPU:0;edge_206_pred_0/d1/bias/read;0:0" request_id: 7357696461822534118
Additional GRPC error information:
{"created":"@1686189090.458307545","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 105411384561817065 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;9d0efc4e4612caec;/job:train/replica:0/task:0/device:GPU:0;edge_206_pred_0/d1/bias/read;0:0" request_id: 7357696461822534118","grpc_status":10}
     [[{{node pred_0/d1/bias/read}}]]
  • 写回答

1条回答 默认 最新

  • 憧憬blog 2023-06-13 17:35
    关注

    这个错误通常是由于 TensorFlow 中的分布式训练中的通信问题导致的。具体地说,可能是由于网络不稳定、计算设备故障或者代码错误等原因导致。

    解决此问题的方法可以尝试以下几点:

    1. 确保所有计算设备都能够正常连接到网络,并且网络稳定。
    2. 检查代码是否有错误,特别是在分布式训练中涉及到的通信部分。
    3. 检查 Tensorflow 版本是否与代码兼容。
    4. 调整分布式训练的参数,例如调整 batch size、learning rate 等,以减少通信量。
    5. 尝试使用更高版本的 TensorFlow 或者其他的分布式训练框架替代 TensorFlow。

    另外,您的错误信息中提到的步骤 ID 和请求 ID 可以帮助您定位问题所在的具体位置,可以根据这些信息进行排查。

    评论

报告相同问题?

问题事件

  • 创建了问题 6月8日