IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu

TL;DR
This paper introduces IGGT, a transformer model that unifies 3D geometric reconstruction and instance-level understanding from 2D images, improving scene coherence and object differentiation.
Contribution
The paper presents a novel end-to-end transformer architecture with a 3D-Consistent Contrastive Learning strategy and a new large-scale dataset for integrated 3D reconstruction and instance understanding.
Findings
Effective 3D scene reconstruction from 2D inputs.
Explicit object instance differentiation in 3D scenes.
Improved generalization in downstream 3D understanding tasks.
Abstract
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial…
Peer Reviews
Decision·ICLR 2026 Poster
The paper addresses an important and timely problem—jointly performing 3D reconstruction and instance-level scene understanding. This capability is highly relevant for downstream applications in robotics, AR/VR, and general 3D scene analysis, where both geometry and object-level understanding are required. - Producing instance masks alongside reconstruction is meaningful for many robotic and perception tasks, as most downstream reasoning and manipulation systems operate at the instance level ra
- An ablation study on the 3D-Consistent Contrastive Loss is missing. Since this loss is central to the paper’s claim of mutual learning between geometry and instance representation, its explicit contribution should be quantified through targeted ablations. - The paper lacks a standard class-agonistic instance segmentation evaluation. Common metrics such as AP, AP50, and AP25 on ground-truth instance labels—widely used in the 3D instance segmentation literature (e.g., SAI3D, SamPart3D, OpenIns3D
1. I think paper is about the incremental improvements over VGGT by unifying spatial reconstruction and instance-level contextual understanding, which is interesting. 2. The writing and presentation are good, and both quantitative and qualitative experiments have been conducted in sufficient detail.
1. The discussion of related work is somewhat insufficient. Several recent and more advanced 3D Multimodal LLMs—such as **Inst3D-LMM (CVPR 2025)** and **Chat-Scene (NeurIPS 2024)**—are not discussed. A more comprehensive review would strengthen the paper. 2. Do the authors consider evaluating IGGT on more general 3D scene understanding tasks to further demonstrate its effectiveness—for example, 3D visual grounding on ScanRefer or Multi3DRefer, and 3D VQA on ScanQA?
1. The dataset contribution in this 3D understanding community is pretty essential. InsScene-15K is a practical, geometry-aware, multi-view dataset with view-consistent instance IDs that scales via SAM-based curation, supports tracking/OVS/reconstruction in one place, and cleanly interfaces with VLM/LMM grounding, filling a real gap for unified 3D perception. 2. Following VGGT in the geometry representation, IGGT uses multi-view images encoder to unifythe visual token represeentations. Additio
1. Some typos: (1) line 198: " 3) a 3D consistent supervision to" ? To what? Super curious about the following sentence. (2) In Fig2, I think you are refering to InsScene-15K. But the pie chart and the right column is ScenePart-15K (3) Scannet --> ScanNet ; Scannet++ --> ScanNet++ 2. There are some new baselines in such kind of unified model for semantic understanding. SceneSplat [1,2] trains a large model to take Gaussian Splatting in and open-vocabulary semantic out, and got a great performa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
