A Framework for Evaluating LLMs Under Task Indeterminacy
Luke Guerdan, Hanna Wallach, Solon Barocas, Alexandra Chouldechova

TL;DR
This paper introduces a framework for evaluating large language models in scenarios where tasks are ambiguous or vague, addressing the limitations of traditional single-answer evaluation methods.
Contribution
It develops a novel framework that accounts for task indeterminacy, disentangles evaluation components, and provides methods to estimate performance intervals considering ambiguity.
Findings
Evaluations assuming a single gold label underestimate true performance.
The framework can estimate error-adjusted performance intervals.
Synthetic experiments demonstrate the importance of accounting for indeterminacy.
Abstract
Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus. However, some tasks can be ambiguous -- i.e., they provide insufficient information to identify a unique interpretation -- or vague -- i.e., they do not clearly indicate where to draw the line when making a determination. Both ambiguity and vagueness can cause task indeterminacy -- the condition where some items in the evaluation corpus have more than one correct response. In this paper, we develop a framework for evaluating LLMs under task indeterminacy. Our framework disentangles the relationships between task specification, human ratings, and LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a synthetic experiment showing that evaluations that use the "gold label" assumption underestimate the true performance. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms · Business Process Modeling and Analysis · Service-Oriented Architecture and Web Services
