Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian, Zhang, Chang Huang, Wenyu Liu, Xinggang Wang

TL;DR
Senna integrates large vision-language models with end-to-end autonomous driving to improve planning accuracy and safety, leveraging scene understanding and reasoning for better decision-making.
Contribution
This work introduces Senna, a novel system combining LVLMs with end-to-end models, featuring decoupled planning and trajectory prediction, and a multi-stage training strategy for autonomous driving.
Findings
Achieves state-of-the-art planning performance on two datasets.
Reduces average planning error by 27.12% after pre-training.
Decreases collision rate by 33.33% with large-scale pre-training.
Abstract
End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-Language Models (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
