Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang; Congbo Ma; Ian Reid; Mohammad Yaqub

arXiv:2505.07527·cs.LG·April 23, 2026

Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub

PDF

1 Repo

TL;DR

This paper introduces KRPO, a Kalman filter-based enhancement to GRPO, improving advantage estimation in reinforcement learning for language models, leading to better reasoning performance.

Contribution

KRPO is a lightweight, parameter-free variant that adaptively estimates rewards using a Kalman filter, enhancing GRPO without significant computational costs.

Findings

01

KRPO improves training reward curves on reasoning benchmarks.

02

KRPO achieves higher final accuracy than standard GRPO.

03

Adaptive advantage estimation benefits language model reasoning.

Abstract

The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

billhhh/KRPO_LLMs_RL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.