Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
William Jurayj, Jeffrey Cheng, Benjamin Van Durme

TL;DR
This paper explores how increasing test-time compute for large language models enhances answer accuracy and confidence, and introduces a framework for evaluating models with varying levels of response risk.
Contribution
It demonstrates that test-time scaling improves both correctness and confidence, and proposes a new evaluation paradigm considering non-zero response risk levels.
Findings
Increased compute at inference improves answer correctness.
Confidence scores correlate with answer accuracy.
Framework for evaluating models with different risk thresholds.
Abstract
Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Educational Strategies and Epistemologies
