TL;DR
GeoThinker introduces an active perception framework for spatial reasoning in multimodal models, enabling selective geometric evidence retrieval to improve spatial understanding and generalization.
Contribution
It proposes GeoThinker, a novel active perception approach that enhances spatial reasoning by selectively integrating geometry conditioned on reasoning demands.
Findings
Achieves a new state-of-the-art score of 72.6 on VSI-Bench.
Demonstrates improved spatial perception in embodied referring and autonomous driving.
Shows robust generalization across complex downstream scenarios.
Abstract
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
