CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Ehsan Aghazadeh; Ahmad Ghasemi; Hedyeh Beyhaghi; Hossein Pishro-Nik

arXiv:2511.02603·cs.CL·November 5, 2025

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

PDF

Open Access 3 Reviews

TL;DR

CGES is a Bayesian method that adaptively stops querying large language models during reasoning, significantly reducing calls while maintaining high accuracy by using confidence signals to determine when enough evidence has been gathered.

Contribution

It introduces a novel confidence-guided early stopping framework for LLMs that adaptively halts sampling based on posterior confidence, improving efficiency without sacrificing accuracy.

Findings

01

Reduces model calls by about 69% on average.

02

Maintains accuracy within 0.06 percentage points of full sampling.

03

Provides theoretical guarantees for confidence calibration.

Abstract

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 3

Strengths

The paper is very well written and does an excellent job explaining the concepts in an accurate and understandable language. I really enjoyed the formal statements for the assumption and the graphical model. The idea to use both the output and an auxiliary signal is novel, and allows the method to utilize the progress in uncertainty quantification of LLMs in the future. The experiments are diverse and extensive enough for confident evaluations. The method performs very well in terms of cost.

Weaknesses

I think the main weakness of the method is that we need to know at least a tight upper bound on the number of final answers. It works well for multiple choices questions, but I expect it to be more challenging in short answer questions where no predefined set of answers is known. I suspect using a loose upper bounds as the upper can drastically hinder the efficiency gains and force the algorithm for the non-existent answers to appear. I could not find the methodology used for the MATH dataset an

Reviewer 02Rating 2Confidence 3

Strengths

I find the overall idea of using confidences of LLM responses to inform test-time scaling useful. This is of course not a new idea by any means, but this paper proposes a scoring approach that uses a particular Bayesian formulation that effectively aggregates confidences from samples. The idea of minimizing the number of samples for test-time scaling is also useful.

Weaknesses

A major weakness of the paper is the problematic formulation. I have several issues about the assumptions of the paper and find them to be problematic. The paper gives the air of theoretical basis/justification for the CGES approach, but from my understanding, some of the assumptions do not make sense and the formulation feels flawed. I will describe some concerns in my detailed comments. Another weakness is the lack of sufficient literature around confidence estimation in general. Confidences

Reviewer 03Rating 6Confidence 3

Strengths

- Test-time scaling and self-consistency for LLMs are topics that received considerable interest recently and therefore advances in this area are definitely warranted. - The paper a simple yet efficient self-consistency scheme that leverages a combination of confidence estimation and Bayesian probabilistic inference to rank and generate fewer candidate responses compared to existing self-consistency methods.

Weaknesses

- I was not able to find any major weaknesses with the paper.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)