Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues

Ye-eun Cho; Yunho Maeng

arXiv:2502.09120·cs.CL·May 20, 2025

Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues

Ye-eun Cho, Yunho Maeng

PDF

Open Access 1 Video

TL;DR

This paper examines whether vision-language models can infer speaker ignorance using visual and linguistic cues, revealing that some models like Claude show emerging pragmatic reasoning abilities.

Contribution

It demonstrates how different VLMs process contextual cues for pragmatic inference, highlighting Claude's potential for integrating multiple cues for human-like reasoning.

Findings

01

Claude integrates visual and linguistic cues more effectively.

02

GPT and Gemini interpret cues literally without combining them.

03

Models treat cues independently, showing limited pragmatic reasoning.

Abstract

This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker's lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues· underline

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Softmax · Cosine Annealing · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Linear Layer · Byte Pair Encoding · Weight Decay · Dropout