Online Causal Kalman Filtering for Stable and Effective Policy Optimization
Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An

TL;DR
This paper introduces an online Kalman filtering approach to stabilize token-level importance sampling ratios in reinforcement learning for language models, improving policy optimization stability and effectiveness.
Contribution
It proposes a novel Online Causal Kalman Filtering method to model and update importance sampling ratios across tokens, addressing structural inconsistencies in previous approaches.
Findings
KPO outperforms state-of-the-art methods on math reasoning datasets.
The Kalman filter effectively smooths noise in importance ratios.
Token-wise structure-aware filtering improves training stability.
Abstract
Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
