VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu

TL;DR
VGGT-Det introduces a novel sensor-geometry-free framework for indoor 3D object detection that leverages VGGT priors and transformer-based components to achieve state-of-the-art results without requiring camera pose or depth data.
Contribution
The paper presents VGGT-Det, the first SG-Free multi-view indoor 3D detection framework that integrates VGGT priors with a transformer pipeline, introducing new attention-guided query generation and feature aggregation modules.
Findings
VGGT-Det outperforms previous methods by 4.4 and 8.6 [email protected] on ScanNet and ARKitScenes.
The proposed modules effectively leverage semantic and geometric priors from VGGT.
The framework operates without sensor-provided geometric inputs, enabling practical deployment.
Abstract
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Neural Network Applications · 3D Shape Modeling and Analysis
