Do LLMs estimate uncertainty well in instruction-following?

Juyeon Heo; Miao Xiong; Christina Heinze-Deml; Jaya Narain

arXiv:2410.14582·cs.AI·March 31, 2025

Do LLMs estimate uncertainty well in instruction-following?

Juyeon Heo, Miao Xiong, Christina Heinze-Deml, Jaya Narain

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper systematically evaluates how well large language models estimate their uncertainty when following instructions, revealing significant limitations and the need for improved methods for trustworthy AI deployment.

Contribution

It introduces a controlled evaluation framework for uncertainty estimation in instruction-following LLMs and provides the first comprehensive analysis of their capabilities and shortcomings.

Findings

01

Existing uncertainty methods struggle with subtle errors.

02

Internal model states offer limited improvements.

03

Evaluation setup reveals key limitations in current approaches.

Abstract

Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs' instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs' uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. The authors provide a systematic evaluation, identifying multiple factors that affect uncertainty estimation and are entangled within naturally generated responses, which helps isolate the influence of various factors and allows for a more accurate assessment of estimation methods' capabilities. 2. The study offers thorough comparisons between different uncertainty estimation methods, various LLMs, and distinct datasets.

Weaknesses

1. The Probe method is mentioned in Table 4 and highlighted as the best-performing method. However, it is not introduced or discussed in Section 3, leading to some confusion when comparing results in Section 3.2.1. A clearer explanation of this method earlier in the paper would improve the comprehensibility of the findings. Or at least keep the scope of methods consistent when discussing "best-performing". 2. As noted by the authors in their limitations section, the types of instructions include

Reviewer 02Rating 8Confidence 3

Strengths

- the paper introduces new benchmark datasets (Controlled and Realistic versions) that isolate factors influencing uncertainty estimation in instruction-following, filling an existing research gap. - the methodologies employed are rigorous, with comprehensive experimental setups involving multiple LLMs and uncertainty estimation techniques. - the writing is clear, and the results are presented in a way that is easy to follow, supported by well-designed figures and tables. - the findings offer va

Weaknesses

- the analysis might benefit from extending the scope of instruction types and domains included in the benchmark to cover more diverse real-world tasks - while the paper identifies the use of internal states for uncertainty estimation as promising, it falls short in exploring more sophisticated methods that could leverage this information in nuanced tasks - the use of GPT-4 for task quality assessment introduces a potential risk of pre-training overlap affecting the evaluation, though this is ac

Reviewer 03Rating 1Confidence 3

Strengths

The study provides crucial insights into LLMs's uncertainty estimation in instruction-following problems.

Weaknesses

1. The paper relies on a single IFEval for its initial evaluation. It would be beneficial to include additional datasets to validate the findings across different contexts and domains. 2. The paper raises many novel concepts and findings, but it does not seem to provide much direct help in enhancing instruction-following capabilities. 3. The article did not use LLMs larger than 13B in its tests, so the conclusions may not be sufficient.

Code & Models

Repositories

apple/ml-uncertainty-llms-instruction-following
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification · Natural Language Processing Techniques