LingoQA: Visual Question Answering for Autonomous Driving
Ana-Maria Marcu, Long Chen, Jan H\"unermann, Alice Karnsund, Benoit, Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex, Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski

TL;DR
LingoQA introduces a large dataset and benchmark for visual question answering in autonomous driving, revealing current models lag behind humans and providing tools for better evaluation.
Contribution
The paper presents a new dataset, benchmark, and evaluation tools for vision-language models in autonomous driving scenarios.
Findings
State-of-the-art models score below human performance.
Lingo-Judge correlates highly with human judgments.
Benchmark and dataset are publicly released for future research.
Abstract
We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections
