Context informs pragmatic interpretation in vision-language models
Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank

TL;DR
This paper investigates how context influences pragmatic reasoning in vision-language models through iterated reference games, revealing that models improve significantly with relevant context but still lag behind humans, especially with abstract referents.
Contribution
The study demonstrates the importance of context in enhancing vision-language models' pragmatic reasoning in multi-turn reference tasks, highlighting current limitations and potential improvements.
Findings
Models outperform chance with relevant context
Performance improves over trials with context
Abstract referents remain challenging for models
Abstract
Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Speech and dialogue systems
