Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
Chongyu Wang, Ting Huang, Chunyu Sun, Xinyu Ning, Di Wang, Hao Tang

TL;DR
GUIDE introduces a multi-level geometric prior injection framework in multimodal LLMs, enhancing spatial awareness and reasoning by progressively integrating local to global geometric features.
Contribution
The paper proposes a novel progressive geometric priors injection method, GUIDE, that captures multi-granularity features and aligns them with early LLM layers for improved spatial reasoning.
Findings
GUIDE outperforms existing methods on spatial reasoning tasks.
Multi-level sampling captures detailed local and global geometric features.
Context-aware gating improves spatial cue utilization and noise suppression.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
