Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He; Lang Feng; Xin Cheng; Lei Feng; Bo An

arXiv:2602.10609·cs.CL·March 3, 2026

Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An

PDF

Open Access

TL;DR

This paper introduces an online Kalman filtering approach to stabilize token-level importance sampling ratios in reinforcement learning for language models, improving policy optimization stability and effectiveness.

Contribution

It proposes a novel Online Causal Kalman Filtering method to model and update importance sampling ratios across tokens, addressing structural inconsistencies in previous approaches.

Findings

01

KPO outperforms state-of-the-art methods on math reasoning datasets.

02

The Kalman filter effectively smooths noise in importance ratios.

03

Token-wise structure-aware filtering improves training stability.

Abstract

Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning