When Contextual Inference Fails: Cancelability in Interactive Instruction Following
Natalia Bila, Kata Nasz\'adi, Alexandra Mayn, Christof Monz

TL;DR
This paper examines how large language models handle contextual inference and clarifications in a collaborative task, revealing they recognize unreliability but often fail to act optimally in ambiguous situations.
Contribution
It introduces BWIM, a new benchmark for testing models' ability to resolve ambiguity or request clarification in interactive tasks.
Findings
Models detect speaker unreliability in confidence ratings.
Models often fail to use unreliability to guide clarification.
Models exhibit suboptimal clarification strategies.
Abstract
We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Neurobiology of Language and Bilingualism
