DriveVGGT: Calibration-Constrained Visual Geometry Transformers for Multi-Camera Autonomous Driving
Xiaosong Jia, Yanhao Liu, Yu Hong, Renqiu Xia, Junqi You, Bin Sun, Zhihui Hao, Junchi Yan

TL;DR
DriveVGGT introduces a scale-aware, domain-specific transformer framework for multi-camera autonomous driving, explicitly integrating priors like sparse overlap, calibration, and rigid extrinsics to improve depth and pose estimation.
Contribution
It proposes DriveVGGT, a novel reconstruction framework that incorporates three domain priors through specialized modules, enhancing efficiency and accuracy over existing methods.
Findings
Reduces inference time by 49.3% on AD datasets.
Improves depth and pose estimation accuracy compared to vanilla VGGT.
Outperforms recent state-of-the-art methods in long-sequence scenarios.
Abstract
Feed-forward reconstruction has been progressed rapidly, with the Visual Geometry Grounded Transformer (VGGT) being a notable baseline. However, directly applying VGGT to autonomous driving (AD) fails to capture three domain-specific priors: (i) Sparse Spatial Overlap: the overlap among mutli-view cameras is minimal due to coverage requirements under budget control, which renders global attention among all images inefficient; (ii) Calibrated Geometric Constraints: the absolute distance among cameras is generally accessible for AD data with calibration process before driving. Standard VGGT is unable to directly utilize such information for absolute scale scene reconstruction; (iii) Rigid Extrinsic Constancy: relative poses of multi-view cameras are approximately static, i.e., the ego-motion is the same for all cameras. To bridge these gaps, we propose DriveVGGT, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
