DVGT: Driving Visual Geometry Transformer

Sicheng Zuo; Zixun Xie; Wenzhao Zheng; Shaoqing Xu; Fang Li; Shengyin Jiang; Long Chen; Zhi-Xin Yang; Jiwen Lu

arXiv:2512.16919·cs.CV·December 19, 2025

DVGT: Driving Visual Geometry Transformer

Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

PDF

Open Access

TL;DR

DVGT is a novel transformer-based model that reconstructs dense 3D scene geometry directly from multi-view visual inputs in autonomous driving, without relying on explicit camera calibration or external sensors.

Contribution

It introduces a flexible, camera-agnostic transformer architecture for dense 3D reconstruction from multi-view images in driving scenarios, outperforming existing methods.

Findings

01

Outperforms existing models on multiple driving datasets.

02

Does not require explicit camera parameters or external sensors.

03

Successfully reconstructs metric-scaled 3D geometry from image sequences.

Abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis