Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram

TL;DR
This paper reveals that current Vision-Language Models lack robust spatial invariance, leading to failures in recognizing objects under simple geometric transformations across various visual domains.
Contribution
It systematically evaluates VLMs' fragility to geometric transformations, highlighting a significant gap between semantic understanding and spatial reasoning.
Findings
VLMs performance drops sharply with simple geometric transformations.
Failures are consistent across different architectures and prompting strategies.
Current VLMs lack robust spatial invariance and equivariance.
Abstract
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
