Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Zhong Guan; Yongjian Guo; Haoran Sun; Wen Huang; Shuai Di; Likang Wu; Xiong Jun Wu; Hongke Zhao

arXiv:2605.12070·cs.LG·May 19, 2026

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Likang Wu, Xiong Jun Wu, Hongke Zhao

PDF

1 Repo

TL;DR

This paper identifies a critical issue in asynchronous RL training where missing old logits disrupt off-policy correction, and proposes methods to accurately or approximately recover these logits, improving training efficiency and performance.

Contribution

It introduces three exact strategies and an approximate correction method to address missing old logits in asynchronous RL, enhancing off-policy correction accuracy.

Findings

01

Revealed the impact of missing old logits on off-policy correction.

02

Proposed three exact old-logit acquisition strategies.

03

Revised PPO-EWMA improves training speed and optimization performance.

Abstract

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

millioniron/ROLL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.