CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz; Long Chen; Ana-Maria Marcu; Jan H\"unermann; Benoit; Hanotte; Alice Karnsund; Jamie Shotton; Elahe Arani; Oleg Sinavski

arXiv:2406.10165·cs.CV·June 17, 2024·3 cites

CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan H\"unermann, Benoit, Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski

PDF

Open Access 1 Repo

TL;DR

CarLLaVA is a vision-language model for autonomous driving that achieves state-of-the-art performance using only camera input, with a novel semi-disentangled output representation and efficient training methods.

Contribution

The paper introduces CarLLaVA, a new vision-language model that outperforms previous methods in camera-only autonomous driving and incorporates a semi-disentangled output for better control.

Findings

01

Achieved 1st place in CARLA Autonomous Driving Challenge 2.0 sensor track.

02

Outperformed previous state-of-the-art by 458%.

03

Demonstrated effective language commentary prediction alongside driving output.

Abstract

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RenzKa/simlingo
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsEntropy Regularization · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator · LLaMA