GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning
Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang, Changlin Li, Xiaojun Chang

TL;DR
GeoSense enhances multimodal reasoning by enabling models to autonomously recognize when geometric information is necessary, improving spatial understanding without unnecessary computation or input rigidity.
Contribution
The paper introduces a novel framework that incorporates perceptual awareness and autonomous engagement of geometric features in multimodal models.
Findings
Significant improvements in spatial reasoning benchmarks.
Models maintain 2D visual reasoning capabilities.
Efficient utilization of geometric features without added overhead.
Abstract
Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation
