GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Jiaxin Zhang; Junjun Jiang; Haijie Li; Youyu Chen; Kui Jiang; Dave Zhenyu Chen

arXiv:2603.16461·cs.CV·March 18, 2026

GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models

Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang, Dave Zhenyu Chen

PDF

Open Access

TL;DR

GAP-MLLM introduces a geometry-aligned pre-training approach that explicitly activates 3D structural perception in multimodal large language models, significantly improving their performance on 3D spatial tasks.

Contribution

The paper proposes a novel pre-training paradigm with geometry-specific supervision and a multi-level fusion module to better utilize 3D structural information in MLLMs.

Findings

01

Enhanced performance in 3D visual grounding

02

Improved 3D dense captioning accuracy

03

Superior results in 3D video object detection

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis