LingoQA: Visual Question Answering for Autonomous Driving

Ana-Maria Marcu; Long Chen; Jan H\"unermann; Alice Karnsund; Benoit; Hanotte; Prajwal Chidananda; Saurabh Nair; Vijay Badrinarayanan; Alex; Kendall; Jamie Shotton; Elahe Arani; Oleg Sinavski

arXiv:2312.14115·cs.RO·September 27, 2024·6 cites

LingoQA: Visual Question Answering for Autonomous Driving

Ana-Maria Marcu, Long Chen, Jan H\"unermann, Alice Karnsund, Benoit, Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex, Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski

PDF

Open Access 2 Repos 1 Models

TL;DR

LingoQA introduces a large dataset and benchmark for visual question answering in autonomous driving, revealing current models lag behind humans and providing tools for better evaluation.

Contribution

The paper presents a new dataset, benchmark, and evaluation tools for vision-language models in autonomous driving scenarios.

Findings

01

State-of-the-art models score below human performance.

02

Lingo-Judge correlates highly with human judgments.

03

Benchmark and dataset are publicly released for future research.

Abstract

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
wayveai/Lingo-Judge
model· 6.4k dl· ♡ 4
6.4k dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Dense Connections