Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication

Filippo Merlo; Ece Takmaz; Wenkai Chen; Albert Gatt

arXiv:2506.22274·cs.CV·February 11, 2026

Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication

Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the COOCo dataset to study how vision-language models utilize scene context and semantic violations in object referencing, revealing adaptive reliance on context and attention patterns related to semantic fit.

Contribution

The paper presents the COOCo dataset and analyzes VLMs' reliance on scene context and attention dynamics in object referencing tasks.

Findings

01

Models adaptively use scene context based on semantic relatedness and noise.

02

Successful categorization correlates with increased mid-layer attention to the target.

03

Attention patterns vary non-monotonically with semantic fit, dropping at moderate fit and rising at low and high fit.

Abstract

To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $Common Objects Out-of-Context (COOCo)$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fmerlo/COOCO
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems