GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning
Zhaochen Liu, Limeng Qiao, Guanglu Wan, Tingting Jiang

TL;DR
GeoAlign introduces a dynamic multi-layer geometric feature aggregation framework that enhances spatial reasoning in multimodal large language models, achieving state-of-the-art results with a compact model.
Contribution
It proposes a novel hierarchical feature bank and layer-wise sparse routing to better align geometric features with spatial reasoning demands.
Findings
Outperforms larger MLLMs on VSI-Bench, ScanQA, and SQA3D datasets.
Achieves state-of-the-art performance with a 4B parameter model.
Effectively improves spatial reasoning in MLLMs.
Abstract
Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
