TL;DR
This paper introduces OneDrive, a unified vision-language-action model for autonomous driving that leverages a single transformer decoder to handle multiple tasks, achieving state-of-the-art results and efficient inference.
Contribution
The work presents a novel unified framework that integrates heterogeneous driving tasks within a pretrained VLM using a single causal decoder, enhancing efficiency and performance.
Findings
Achieves 0.28 L2 and 0.18 collision rate on nuScenes.
Attains 86.8 PDMS on NAVSIM.
Reduces inference latency by approximately 40%.
Abstract
Vision-Language Models(VLMs) excel at autoregressive text generation, yet end-to-end autonomous driving requires multi-task learning with structured outputs and heterogeneous decoding behaviors, such as autoregressive language generation, parallel object detection and trajectory regression. To accommodate these differences, existing systems typically introduce separate or cascaded decoders, resulting in architectural fragmentation and limited backbone reuse. In this work, we present a unified autonomous driving framework built upon a pretrained VLM, where heterogeneous decoding behaviors are reconciled within a single transformer decoder. We demonstrate that pretrained VLM attention exhibits strong transferability beyond pure language modeling. By organizing visual and structured query tokens within a single causal decoder, structured queries can naturally condition on visual context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
