PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning
Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang

TL;DR
This paper challenges the belief that more aggressive exploration improves iterative alignment in mathematical reasoning, revealing that a generation-based corrective approach with minimal exploration budget can outperform traditional methods.
Contribution
The paper introduces PACE, a novel alignment method that replaces brute-force mining with a generation-based strategy, achieving better performance with less compute and increased robustness.
Findings
PACE outperforms DPO-R1 with fewer resources
Aggressive exploration can cause policy collapse in mathematical reasoning
PACE is more robust against reward hacking and label noise
Abstract
Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., ) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget (), PACE synthesizes high-fidelity preference pairs from failed explorations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Machine Learning and Data Classification · Topic Modeling
