Do LLMs estimate uncertainty well in instruction-following?
Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain

TL;DR
This paper systematically evaluates how well large language models estimate their uncertainty when following instructions, revealing significant limitations and the need for improved methods for trustworthy AI deployment.
Contribution
It introduces a controlled evaluation framework for uncertainty estimation in instruction-following LLMs and provides the first comprehensive analysis of their capabilities and shortcomings.
Findings
Existing uncertainty methods struggle with subtle errors.
Internal model states offer limited improvements.
Evaluation setup reveals key limitations in current approaches.
Abstract
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs' instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs' uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two…
Peer Reviews
Decision·ICLR 2025 Poster
1. The authors provide a systematic evaluation, identifying multiple factors that affect uncertainty estimation and are entangled within naturally generated responses, which helps isolate the influence of various factors and allows for a more accurate assessment of estimation methods' capabilities. 2. The study offers thorough comparisons between different uncertainty estimation methods, various LLMs, and distinct datasets.
1. The Probe method is mentioned in Table 4 and highlighted as the best-performing method. However, it is not introduced or discussed in Section 3, leading to some confusion when comparing results in Section 3.2.1. A clearer explanation of this method earlier in the paper would improve the comprehensibility of the findings. Or at least keep the scope of methods consistent when discussing "best-performing". 2. As noted by the authors in their limitations section, the types of instructions include
- the paper introduces new benchmark datasets (Controlled and Realistic versions) that isolate factors influencing uncertainty estimation in instruction-following, filling an existing research gap. - the methodologies employed are rigorous, with comprehensive experimental setups involving multiple LLMs and uncertainty estimation techniques. - the writing is clear, and the results are presented in a way that is easy to follow, supported by well-designed figures and tables. - the findings offer va
- the analysis might benefit from extending the scope of instruction types and domains included in the benchmark to cover more diverse real-world tasks - while the paper identifies the use of internal states for uncertainty estimation as promising, it falls short in exploring more sophisticated methods that could leverage this information in nuanced tasks - the use of GPT-4 for task quality assessment introduces a potential risk of pre-training overlap affecting the evaluation, though this is ac
The study provides crucial insights into LLMs's uncertainty estimation in instruction-following problems.
1. The paper relies on a single IFEval for its initial evaluation. It would be beneficial to include additional datasets to validate the findings across different contexts and domains. 2. The paper raises many novel concepts and findings, but it does not seem to provide much direct help in enhancing instruction-following capabilities. 3. The article did not use LLMs larger than 13B in its tests, so the conclusions may not be sufficient.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification · Natural Language Processing Techniques
