Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Yongyu Mu; Jiali Zeng; Fandong Meng; JingBo Zhu; Tong Xiao

arXiv:2603.16206·cs.LG·March 18, 2026

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao

PDF

Open Access

TL;DR

This paper introduces Offline Exploration-Aware fine-tuning (OXA), a novel method that enhances large language models' mathematical reasoning by optimizing data utilization during supervised fine-tuning, leading to improved exploration and performance.

Contribution

The paper proposes OXA, a new fine-tuning approach that promotes better exploration by adjusting data confidence levels, significantly improving reasoning performance over traditional methods.

Findings

01

OXA achieves an average of +6 Pass@1 and +5 Pass@$k$ improvements.

02

OXA increases initial policy entropy, fostering better exploration.

03

Performance gains from OXA persist during extensive RLVR training.

Abstract

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Materials Science