fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum
Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu

TL;DR
FG-ExPO enhances reinforcement learning for mathematical reasoning by adaptively adjusting exploration and sampling strategies, leading to significant performance improvements on multiple benchmarks.
Contribution
It introduces AKL and GCS components that dynamically balance exploration and focus training on the learning frontier, outperforming standard GRPO.
Findings
Achieves 13.34% absolute improvement on AIME 2025 pass@32 metric.
Outperforms vanilla GRPO across six benchmarks.
Enlarges the effective exploration space of models.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, with Group Relative Policy Optimization (GRPO) serving as the dominant algorithm. We identify two overlooked inefficiencies inherent in GRPO. First, a fixed KL coefficient overly restricts policy exploration at moments when the model needs to diverge significantly from the reference policy. Second, uniform question sampling overlooks that moderately difficult problems produce the most informative gradient signals. We propose FG-ExPO, short for Frontier-Guided Exploration-Prioritized Policy Optimization, which integrates two lightweight components. Accuracy-Conditioned KL Scaling (AKL) adjusts the KL penalty strength through a smooth nonlinear function of batch average accuracy, loosening the constraint when the model performs poorly and strengthening it when the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
