TL;DR
This paper identifies a critical issue in asynchronous RL training where missing old logits disrupt off-policy correction, and proposes methods to accurately or approximately recover these logits, improving training efficiency and performance.
Contribution
It introduces three exact strategies and an approximate correction method to address missing old logits in asynchronous RL, enhancing off-policy correction accuracy.
Findings
Revealed the impact of missing old logits on off-policy correction.
Proposed three exact old-logit acquisition strategies.
Revised PPO-EWMA improves training speed and optimization performance.
Abstract
Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
