TL;DR
GeoSR enhances vision-language models' spatial reasoning by strategically masking visual cues and adaptively emphasizing geometric information, leading to state-of-the-art performance on spatial reasoning benchmarks.
Contribution
The paper introduces GeoSR, a novel framework that actively encourages VLMs to utilize geometry tokens through masking and fusion mechanisms, improving spatial reasoning capabilities.
Findings
GeoSR outperforms prior methods on static and dynamic spatial reasoning benchmarks.
Masking 2D visual tokens encourages reliance on geometric cues.
Adaptive fusion amplifies geometric evidence where it is most needed.
Abstract
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
