Contextual inference from single objects in Vision-Language models
Martina G. Vilas, Timothy Schauml\"offel, Gemma Roig

TL;DR
This study investigates how vision-language models infer scene context from single objects, revealing their capabilities, limitations, and underlying mechanisms in understanding scene categories and superordinate contexts.
Contribution
It provides a systematic analysis of contextual inference in VLMs, highlighting differences from human perception and uncovering the mechanistic basis of scene understanding.
Findings
Single objects enable above-chance scene inference in VLMs.
Object properties predict human-like scene categorization.
Scene and superordinate information are encoded differently within models.
Abstract
How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
