TL;DR
BareBones is a new benchmark that rigorously tests whether vision-language models truly understand geometric shapes, revealing widespread reliance on textures and environmental cues.
Contribution
It introduces a pixel-level silhouette benchmark across multiple datasets, exposing the texture bias in current models and establishing a standard for geometric comprehension evaluation.
Findings
Models perform poorly without RGB textures, indicating a reliance on visual shortcuts.
The benchmark exposes universal structural blindspots in state-of-the-art VLMs.
Performance collapse under RGB deprivation is termed the 'Texture Bias Cliff'.
Abstract
While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce , a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
