Query-Level Uncertainty in Large Language Models
Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Ga\"el Varoquaux

TL;DR
This paper introduces a training-free method called Internal Confidence to detect when large language models are uncertain about a query, enabling more efficient and trustworthy adaptive inference without additional training.
Contribution
Proposes a novel, training-free uncertainty estimation method for LLMs that improves confidence accuracy and reduces inference costs in adaptive settings.
Findings
Internal Confidence outperforms baselines in confidence quality
Reduces inference costs in retrieval-augmented and cascading models
Maintains performance while improving efficiency
Abstract
It is important for Large Language Models (LLMs) to be aware of the boundary of their knowledge, distinguishing queries they can confidently answer from those that lie beyond their capabilities. Such awareness enables models to perform adaptive inference, such as invoking retrieval-augmented generation (RAG), engaging in slow and deep thinking, or abstaining from answering when appropriate. These mechanisms are key to developing efficient and trustworthy AI. In this work, we propose a method to detect knowledge boundaries via Query-Level Uncertainty, which estimates if a model is capable of answering a given query before generating any tokens, thus avoiding the generation cost. To this end, we propose a novel, training-free method called Internal Confidence, which leverages self-evaluations across layers and tokens to provide a reliable signal of uncertainty. Empirical studies on both…
Peer Reviews
Decision·ICLR 2026 Poster
Pre-generation, single-pass signal. IC avoids generating answers and extra prompts; it’s computed from internal states with one forward pass, which is appealing for latency/cost. Center-weighted ensembling across tokens×layers. The “decision center” idea is well-motivated by heatmaps showing the best separator isn’t always exactly the last position; the attenuated average reduces variance while keeping locality. Practical routing use-cases. Clear demonstrations for RAG triggering and small→lar
1) Baselines. It is puzzling that the baselines differ between Tables 1 and 2; they should be consistent to allow a fair comparison of self-knowledge and efficiency across models and tasks. Moreover, the state-of-the-art claim cannot be substantiated without including a broader set of strong baselines. Following the recent TACL benchmark on uncertainty quantification [1] , at minimum the evaluation should cover half of these baselines (that outperform SAR as a top baseline from the paper): CCP,
* Framing “knowledge boundary” as pre-answer uncertainty is useful for agentic pipelines (RAG, slow-thinking, model cascades). * Training-free and fast. Single forward pass speed largely independent of answer length, high leverage for long-form tasks and tool-heavy agents. * IC uses only standard hidden states + unembedding, attenuation over layers and tokens captures the information where the “can I answer?” signal concentrates. * Fixed decision center and a single locality hyper-parameter yiel
1. Operational definition of “knowledge boundary.” The binary label is tied to greedy decoding success. This could confuse capability with decoding heuristics and underestimate the success of non-greedy decoding. Sensitivity analysis of decoding strategies would help strengthen these claims. 2. IC requires hidden states and full model access. Many production black-box APIs don’t expose them. 3. The method fixes the decision center (last-layer, last-token) and selects a single decay schedule. A m
Query-level uncertainty is a genuinely different and practical perspective from existing answer-level methods. Rather than generating long answers to assess uncertainty, the method predicts answerability before token generation, directly addressing efficiency bottlenecks in real-world systems. The Internal Confidence method is elegantly simple—it aggregates P(YES) signals across layers and tokens with attenuated encoding weights, producing calibrated uncertainty in a single forward pass without
The theoretical justification is weak. While P(YES) seems intuitive as a proxy for query answerability, the connection to actual answering capability is assumed rather than proven. Similarly, attenuated encoding is presented as effective, but other aggregation schemes aren't systematically compared to justify this choice. The "decision center" concept, which is central to the method, lacks theoretical grounding—the paper doesn't explain why the top-right position is optimal or how it should adap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Machine Learning in Materials Science
