Align and Filter: Improving Performance in Asynchronous On-Policy RL
Homayoun Honari, Roger Creus Castanyer, Michael Przystupa, Michael Noukhovitch, Pablo Samuel Castro, Glen Berseth

TL;DR
This paper introduces a new method called TVAC to mitigate policy lag in distributed on-policy reinforcement learning, improving robustness and performance in both classic and modern tasks.
Contribution
The paper identifies sources of policy lag in distributed RL and proposes TVAC, a novel approach that aligns advantages to reduce lag and enhance learning stability.
Findings
TVAC improves robustness to policy lag in classic RL tasks.
TVAC enhances performance in RL tasks involving language models.
Empirical results show better stability and efficiency with TVAC.
Abstract
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper offers solid, well-structured theoretical analysis that clearly supports the method. 2. The proposed approach targets a relevant problem in distributed/on-policy RL. 3. The experiments stay focused on the core question, with detailed comparisons provided in the appendix.
Limited sensitivity analysis of the TV threshold. The method fixes the throshold and does not ablate it. It’s unclear how training stability and performance depends on throshold ,
1. The paper theoretically characterizes lag. While it seems obvious to me that lag would interfere with policy optimization, it's very useful to show this mathematically. 2. Experiments on both RLVR for LLMs and more standard RL tasks (MuJoCo) is a nice plus. The "RL for LLMs" feels too distinct from the broader RL community, so I'm glad that people are writing papers that evaluate on both setups.
I lean to reject due to limited experiments and missing discussion on existing works for managing off-policy-ness during on-policy learning. 1. **On-policy learning with Off-policy data**. My understanding is that the policy lag problem is simply the problem of trying to perform on-policy updates with off-policy data. Several works have studied how to leverage off-policy data for on-policy updates, especially importance sampling methods like Queeney et al 2021, but the only baselines considered
1. Policy lag is a real bottleneck in scaling on-policy RL and LLM fine-tuning. The paper’s framing of forward vs backward lag is conceptually clear and useful. 2. Demonstrating both robotics and LLM reasoning experiments is good.
1. The filtering rule (Eq. 16) is heuristic; there is no formal proof that it guarantees constraint satisfaction or unbiased gradient estimates. 2. The use of TV divergence in continuous action spaces is only approximated by sample-based density ratios (Eq. 5). 3. 4 seeds are statistically insufficient for bootstrap confidence intervals, the CIs may not be meaningful. 4. Here is my major concern. It’s quite surprising that VACO needs 100 million steps to show improvements on MuJoCo, these are re
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
