Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Feng Chen; Allan Raventos; Nan Cheng; Surya Ganguli; Shaul Druckmann

arXiv:2502.07154·cs.LG·November 26, 2025

Rethinking Fine-Tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann

PDF

Open Access 1 Repo

TL;DR

This paper investigates how limiting model confidence during training can improve large language models' mathematical reasoning performance when using test-time compute strategies like pass@N, revealing the importance of co-designing training and inference.

Contribution

It introduces a modified training loss that reduces overconfidence, aligning training with pass@N, and demonstrates improved reasoning performance on math benchmarks.

Findings

01

Overconfidence from cross-entropy loss impairs pass@N accuracy.

02

Limiting confidence during training enhances mathematical reasoning.

03

Modified loss improves performance on MATH and MiniF2F benchmarks.

Abstract

Recent progress in large language models (LLMs) highlights the power of scaling test-time compute to achieve strong performance on complex tasks, such as mathematical reasoning and code generation. This raises a critical question: how should model training be modified to optimize performance under a subsequent test-time compute strategy and budget? To explore this, we focus on pass@N, a simple test-time strategy that searches for a correct answer in $N$ independent samples. We show, surprisingly, that training with cross-entropy (CE) loss can be $misaligned$ with pass@N in that pass@N accuracy $decreases$ with longer training. We explain the origins of this misalignment in terms of model overconfidence induced by CE, and experimentally verify our prediction of overconfidence as an impediment to scaling test-time compute via pass@N. Furthermore we suggest a principled,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allanraventos/refine
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Computability, Logic, AI Algorithms · Neural Networks and Applications

MethodsFocus