DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo; Wenzhao Zheng; Sicheng Zuo; Siming Yan; Lu Hou; Jie Zhou; Jiwen Lu

arXiv:2603.19219·cs.CV·March 20, 2026

DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu

PDF

Open Access

TL;DR

DriveTok introduces a novel 3D scene tokenizer for autonomous driving that efficiently unifies multi-view reconstruction and understanding by integrating semantic, geometric, and textural information into scene tokens.

Contribution

It proposes a new 3D driving scene tokenizer that improves multi-view scene understanding and reconstruction using semantic-rich features and deformable cross-attention.

Findings

01

Outperforms existing methods on nuScenes in reconstruction and segmentation tasks

02

Achieves efficient multi-view 3D scene understanding with unified tokens

03

Demonstrates strong performance in depth and occupancy prediction

Abstract

With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications