PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao; Zixiong Yu; Xuebo Liu; Guhan Chen; Jing Li; Jiansheng Wei; Xiaojun Meng; Min Zhang

arXiv:2602.05370·cs.CL·February 9, 2026

PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang

PDF

Open Access

TL;DR

This paper challenges the belief that more aggressive exploration improves iterative alignment in mathematical reasoning, revealing that a generation-based corrective approach with minimal exploration budget can outperform traditional methods.

Contribution

The paper introduces PACE, a novel alignment method that replaces brute-force mining with a generation-based strategy, achieving better performance with less compute and increased robustness.

Findings

01

PACE outperforms DPO-R1 with fewer resources

02

Aggressive exploration can cause policy collapse in mathematical reasoning

03

PACE is more robust against reward hacking and label noise

Abstract

Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \geq 8$ ) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ( $2 < N < 3$ ), PACE synthesizes high-fidelity preference pairs from failed explorations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Machine Learning and Data Classification · Topic Modeling