Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

William Jurayj; Jeffrey Cheng; Benjamin Van Durme

arXiv:2502.13962·cs.CL·July 21, 2025

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

William Jurayj, Jeffrey Cheng, Benjamin Van Durme

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores how increasing test-time compute for large language models enhances answer accuracy and confidence, and introduces a framework for evaluating models with varying levels of response risk.

Contribution

It demonstrates that test-time scaling improves both correctness and confidence, and proposes a new evaluation paradigm considering non-zero response risk levels.

Findings

01

Increased compute at inference improves answer correctness.

02

Confidence scores correlate with answer accuracy.

03

Framework for evaluating models with different risk thresholds.

Abstract

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wjurayj/final_answer
pytorchOfficial

Videos

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering· underline

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Educational Strategies and Epistemologies