expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Mingxiong Lin; Zhangquan Gong; Maowen Tang; Qian Li; Chuangchuang Wang; Jian Ma; Sutian Huang; Kai Tang; Haonan Lu

arXiv:2605.09923·cs.AI·May 14, 2026

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu

PDF

TL;DR

This paper introduces EXPO, a reinforcement learning method that enhances policy exploration and training efficiency through adaptive KL regulation and Gaussian curriculum sampling, leading to significant performance improvements in mathematical reasoning tasks.

Contribution

The paper proposes two novel modules, AKL and GCS, to improve policy optimization by dynamically adjusting regularization and focusing on informative training samples, outperforming existing methods.

Findings

01

EXPO achieves a 13.34 point gain on AIME 2025 pass@32, from 63.33% to 76.67%.

02

EXPO improves average pass@32 by 2.66 points on 8B models.

03

Larger gains on pass@32 indicate better exploration boundary enlargement.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.