TL;DR
This paper introduces KRPO, a Kalman filter-based enhancement to GRPO, improving advantage estimation in reinforcement learning for language models, leading to better reasoning performance.
Contribution
KRPO is a lightweight, parameter-free variant that adaptively estimates rewards using a Kalman filter, enhancing GRPO without significant computational costs.
Findings
KRPO improves training reward curves on reasoning benchmarks.
KRPO achieves higher final accuracy than standard GRPO.
Adaptive advantage estimation benefits language model reasoning.
Abstract
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. For language modeling, Group Relative Policy Optimization (GRPO) was proposed to use the within-group sample mean as a baseline for advantage normalization. This estimator can be sensitive to small group size and rollout-level stochasticity, which may lead to suboptimal advantage estimates in some settings. In this paper, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), a lightweight variant that treats per-group rewards as noisy observations of a latent prompt-level reward baseline and uses a 1D Kalman filter to estimate both the baseline and its uncertainty. KRPO introduces no additional learned parameters and can be integrated into GRPO with minimal computational overhead. On mathematical reasoning benchmarks, KRPO consistently improves training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
