TL;DR
GeoWeaver introduces a novel geometric grounding framework that adaptively incorporates geometric evidence into visual tokens, significantly improving spatial reasoning in vision-language models.
Contribution
It proposes a token-adaptive geometric evidence allocation method that enhances geometry-aware reasoning by grounding visual tokens with relevant geometric abstractions.
Findings
Consistently improves spatial reasoning benchmarks.
Retains general multimodal capabilities.
Highlights geometric information as a fundamental reasoning prerequisite.
Abstract
Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
