Self-Consistency of Large Language Models under Ambiguity
Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason, Hoelscher-Obermaier, Jacob Pfau

TL;DR
This paper evaluates the self-consistency of large language models in ambiguous tasks, revealing that models tend to be internally multi-possibility aware and that self-consistency improves with capability, despite calibration issues.
Contribution
The authors introduce a benchmark for assessing self-consistency in LLMs under ambiguity and analyze models' behavior, robustness, and internal probability distributions.
Findings
Models achieve 67-82% consistency, higher than random chance.
Self-consistency increases with model capability.
Models often assign significant probability to alternative answers.
Abstract
Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency, e.g., question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67\% to 82\%, far higher than would be predicted if a model's consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self-consistency arises as an emergent capability without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
