Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng, Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang, Xiangyu Zhang

TL;DR
This paper proposes a 3D-tokenized LLM approach called Atlas, which leverages 3D priors for improved perception and planning in autonomous driving, outperforming traditional 2D-tokenized methods.
Contribution
It introduces a novel 3D tokenization method using DETR-style perceptrons, connecting LLMs with 3D environment understanding for autonomous driving.
Findings
Atlas outperforms 2D-tokenized LLMs in 3D detection
Atlas improves ego planning accuracy
3D tokenization enhances reliability in autonomous driving
Abstract
Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper introduces Atlas, a novel 3D-tokenized LLM framework that integrates 3D geometric priors, addressing the limitations of existing 2D-tokenized VLMs in autonomous driving. - Through comprehensive evaluations on the nuScenes dataset, Atlas demonstrates superior performance in 3D object detection, lane detection, and ego-car planning, validating its effectiveness over traditional methods. - The paper clearly explains the need for 3D tokenization in autonomous driving, providing structu
- Using 3D perception as tokens will directly benefit from the perception labels, significantly increase the model size, and thus make it **unfair** to compare with those VLMs with images as inputs. - The paper uses StreamPETR as the 3D perception backbone. However, from Table 1 joint training with LLMs significantly harms the perception performance, which indicates the ineffectiveness of the proposed method. - Table 2 also shows a performance degradation in lane detection. - Open-loop plann
- The paper introduces a 3D-tokenized framework that effectively bridges the gap between 2D perception in current VLMs and the 3D requirements of autonomous driving. This integration is both novel and practical for enhancing perception and planning tasks. - The paper conducts thorough experiments across multiple perception and planning tasks using the nuScenes dataset, with Atlas outperforming previous VLM-based approaches in 3D perception and open-loop planning tasks. - The chain-of-thought de
- While this work introduces advancements in spatial perception by incorporating 3D tokenizers, it would benefit from a comparison with recent open-source studies that explore large language models’ spatial understanding in driving scenarios, such as [1]. - The impressive performance in open-loop planning is acknowledged, yet the detailed aspects contributing to this strength remain unclear. Further explanation of the performance-driving factors would enhance the clarity and reliability of these
1. The paper is well-written and easy to understand. 2. The work highlights the shortcomings of 2D-tokenized LLMs and demonstrates how a 3D tokenizer can enhance performance.
1. About the framework design: The necessity of connecting a 3D tokenizer to an LLM is questionable. A baseline experiment could clarify this by showing whether a simple combination of a 3D tokenizer and MLP could achieve similar planning results. The main advantage of incorporating an LLM seems to be interpretability, as LLMs may generate explanations, but whether they can control the trajectory as effectively remains unclear. Additionally, the LLM introduces limitations, such as reduced speed
1. The exploration of VLM-based methods in autonomous driving is interesting. This paper highlights the significant differences in perception between task-specific models and 2D tokenized VLM-based approaches. 2. The paper combines DETR-style 3D perceptrons with VLMs, leveraging 3D priors for improved depth perception and supporting various image types and temporal modeling. 3. The evaluation on the nuScenes dataset demonstrates enhancements in 3D detection and planning tasks, showcasing the r
1. Overclaiming: The paper argues that existing 2D VLMs struggle in achieving 3D perception, but it has been shown that by pre-training a 2D tokenizer, 2D VLMs can outperform traditional 3D detection methods [1]. The authors are suggested to compare the performance curves of the 2D and 3D tokenizers as the amount of traing data grows. Additionally, other works [2] utilize 3D encoders like depth maps and extrinsics for 3D scene understanding. How does DETR-style tokenizer compare to or improve up
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety
