DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu

TL;DR
DriveTok introduces a novel 3D scene tokenizer for autonomous driving that efficiently unifies multi-view reconstruction and understanding by integrating semantic, geometric, and textural information into scene tokens.
Contribution
It proposes a new 3D driving scene tokenizer that improves multi-view scene understanding and reconstruction using semantic-rich features and deformable cross-attention.
Findings
Outperforms existing methods on nuScenes in reconstruction and segmentation tasks
Achieves efficient multi-view 3D scene understanding with unified tokens
Demonstrates strong performance in depth and occupancy prediction
Abstract
With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Neural Network Applications
