Reality Bites: Assessing the Realism of Driving Scenarios with Large Language Models
Jiahui Wu, Chengjie Lu, Aitor Arrieta, Tao Yue, Shaukat Ali

TL;DR
This paper evaluates the ability of large language models to assess the realism of driving scenarios, demonstrating that GPT outperforms other models in robustness across various conditions, which is vital for autonomous driving testing.
Contribution
It introduces an empirical evaluation framework for assessing LLMs' effectiveness in judging driving scenario realism, highlighting GPT's superior robustness.
Findings
GPT achieved highest robustness across scenarios
Weather and road conditions affect LLM performance
Mistral performed the worst in assessments
Abstract
Large Language Models (LLMs) are demonstrating outstanding potential for tasks such as text generation, summarization, and classification. Given that such models are trained on a humongous amount of online knowledge, we hypothesize that LLMs can assess whether driving scenarios generated by autonomous driving testing techniques are realistic, i.e., being aligned with real-world driving conditions. To test this hypothesis, we conducted an empirical evaluation to assess whether LLMs are effective and robust in performing the task. This reality check is an important step towards devising LLM-based autonomous driving testing techniques. For our empirical evaluation, we selected 64 realistic scenarios from \deepscenario--an open driving scenario dataset. Next, by introducing minor changes to them, we created 512 additional realistic scenarios, to form an overall dataset of 576 scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods
