How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors
Yifan Xu, Junren Chen, Yifan Chen

TL;DR
This paper introduces IMAX, a framework that enhances exploration in RLVR by training soft prefixes to diversify reasoning trajectories, leading to significant performance improvements across multiple scales.
Contribution
IMAX provides a novel, model-agnostic approach to improve exploration in RLVR by using trainable prefixes and an information maximization reward, outperforming standard methods.
Findings
IMAX achieves up to 11.60% improvement in Pass@4.
IMAX consistently outperforms standard RLVR across three backbone scales.
The framework is compatible with existing RLVR pipelines.
Abstract
Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emph{entropy collapse} phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model's prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
