MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu; Haodong Wang; Jiayi Ji; Yutian Yao; Chunsai Du; Jihua Kang; Yanwei Fu; Liujuan Cao

arXiv:2601.06874·cs.CV·April 1, 2026

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao

PDF

2 Repos 1 Models 1 Datasets

TL;DR

This paper introduces MVGGT, an end-to-end multimodal transformer for multiview 3D referring expression segmentation that works efficiently with sparse RGB views, overcoming limitations of traditional point cloud methods.

Contribution

The paper proposes MVGGT, a novel framework integrating language and geometric reasoning, and introduces PVSO to improve training stability in sparse-view 3D segmentation.

Findings

01

MVGGT achieves state-of-the-art accuracy on MVRefer benchmark.

02

The method runs faster than traditional two-stage pipelines.

03

PVSO stabilizes training with sparse 3D signals.

Abstract

Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sosppxo/mvggt
model· ♡ 1
♡ 1

Datasets

sosppxo/MVRefer
dataset· 125 dl
125 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.