Contextual inference from single objects in Vision-Language models

Martina G. Vilas; Timothy Schauml\"offel; Gemma Roig

arXiv:2603.26731·cs.CV·March 31, 2026

Contextual inference from single objects in Vision-Language models

Martina G. Vilas, Timothy Schauml\"offel, Gemma Roig

PDF

TL;DR

This study investigates how vision-language models infer scene context from single objects, revealing their capabilities, limitations, and underlying mechanisms in understanding scene categories and superordinate contexts.

Contribution

It provides a systematic analysis of contextual inference in VLMs, highlighting differences from human perception and uncovering the mechanistic basis of scene understanding.

Findings

01

Single objects enable above-chance scene inference in VLMs.

02

Object properties predict human-like scene categorization.

03

Scene and superordinate information are encoded differently within models.

Abstract

How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.