tensorflow+python运行强化学习算法时报错
算法架构为分布式多进程架构,包含1个'ps',2个'worker'
在一个worker训练时,每到第二次sess.run均会报以下错误:
Process Process-6:
Traceback (most recent call last):
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 105411384561817065 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;9d0efc4e4612caec;/job:train/replica:0/task:0/device:GPU:0;edge_206_pred_0/d1/bias/read;0:0" request_id: 7357696461822534118
Additional GRPC error information:
{"created":"@1686189090.458307545","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 105411384561817065 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;9d0efc4e4612caec;/job:train/replica:0/task:0/device:GPU:0;edge_206_pred_0/d1/bias/read;0:0" request_id: 7357696461822534118","grpc_status":10}
[[{{node pred_0/d1/bias/read}}]]