DriveQA: Passing the Driving Knowledge Test

Maolin Wei; Wanzhou Liu; Eshed Ohn-Bar

arXiv:2508.21824·cs.CV·September 1, 2025

DriveQA: Passing the Driving Knowledge Test

Maolin Wei, Wanzhou Liu, Eshed Ohn-Bar

PDF

Open Access 1 Datasets

TL;DR

DriveQA is a comprehensive benchmark for evaluating large language models' understanding of traffic rules and scenarios, revealing strengths and weaknesses in driving knowledge comprehension and aiding model improvement.

Contribution

This work introduces DriveQA, a novel open-source benchmark covering traffic regulations and scenarios for assessing and enhancing LLMs' driving knowledge understanding.

Findings

01

LLMs perform well on basic traffic rules but struggle with complex reasoning.

02

Fine-tuning on DriveQA improves accuracy in traffic sign recognition and intersection decisions.

03

Pretraining on DriveQA boosts downstream driving task performance and generalization.

Abstract

If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves…

Tables1

Table 1. Table 3 : Performance of CoT Reasoning on DriveQA-T. The evaluation includes both off-the-shelf and fine-tuned models under two settings of with and without RAG.

Off-The-Shelf Models
Models	Size	BLEU-4		ROUGE-L
Models	Size	w/o RAG	w/ RAG	w/o RAG	w/ RAG
Gemma-2 [75]	2B	0.1098	0.1704	0.2920	0.3387
Gemma-2 [75]	9B	0.3234	0.3116	0.4295	0.4276
Llama-3.1 [26]	8B	0.2573	0.2619	0.3270	0.3317
Llama-3.2 [26]	3B	0.2258	0.3140	0.3348	0.4024
Phi-3.5-mini [3]	3.8B	0.2437	0.2574	0.3616	0.3996
GPT-4o [59]	-	0.3905	0.3989	0.5354	0.5393
Finetuned Models
Gemma-2 [75]	2B	0.3623	0.2934	0.5058	0.4458
Gemma-2 [75]	9B	0.4112	0.4105	0.5420	0.5528
Llama-3.1 [26]	8B	0.3042	0.2946	0.4749	0.4750
Llama-3.2 [26]	3B	0.2131	0.1916	0.3853	0.3570
Phi-3.5-mini [3]	3.8B	0.2362	0.1891	0.4073	0.3476

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DriveQA/DriveQA_Dataset
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety · Autonomous Vehicle Technology and Safety

Full text

Passing the Driving Knowledge Test

Maolin Wei1 Wanzhou Liu211footnotemark: 1 Eshed Ohn-Bar1

1Boston University 2Washington University in St. Louis Equally contributed.

Abstract

If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks. Project page: https://driveqaiccv.github.io.

1 Introduction

Safe navigation in traffic requires not only recognizing and interpreting visual information but also reasoning over traffic rules and making decisions that align with regulations. To ensure drivers develop these critical skills, before receiving their license they must first pass a written knowledge test—a structured (multiple choice questions) assessment designed to evaluate precise understanding of traffic laws, right-of-way rules, and complex driving scenarios [32, 56].

Driving tests are not merely procedural; they assess a driver’s ability to apply reasoning across a wide range of traffic conditions. While primarily textual, these tests may also include graphical illustrations to ground questions in real-world scenarios. Recent advances in Multimodal Large Language Models (MLLMs) [26, 75, 97, 47, 3] as general-purpose reasoning models provide an opportunity to explore a key question: how well do current vision-and-language models perform when faced with the same driving knowledge assessments? Even without targeted fine-tuning, MLLMs may inherit some traffic rule knowledge from their pretraining data (however, our findings indicate that both such knowledge and associated reasoning capabilities remain limited).

Researchers have been increasingly integrating MLLMs into autonomous driving systems [54, 53, 89, 28, 84, 14, 70, 11, 98, 77, 42, 51, 95, 39]. However, while these models are often tested on perception-focused benchmarks that emphasize spatial awareness and standard planning tasks (e.g., lane keeping, collision avoidance [44, 87, 78]), they are rarely evaluated for their ability to understand and comply with diverse traffic regulations, such as reasoning about traffic rules, reacting safely to no-entry signs, or maintaining speed limit. While most existing datasets narrowly focus on perception and basic trajectory planning, driving knowledge tests are designed to assess a broad spectrum of all regulations, including rare traffic signs, difficult right-of-way cases, and edge-case rules that are essential for safe navigation but seldom appear in collected driving data. This highlights a critical gap in evaluating AI systems: while they may perform well in current benchmarks, their ability to reason over long-tail traffic rules and regulatory compliance remains understudied. There is also substantial anecdotal evidence suggesting that current commercial systems, e.g., Tesla’s Full Self-Driving [24, 36, 79, 25], often struggle with interpreting traffic rules.

To address this gap and enhance the evaluation of reasoning capabilities in both LLMs and MLLMs, we introduce a novel driving knowledge benchmark, DriveQA. Our dataset includes both text-only question-answers (QA) and aligned image-text (VQA) pairs. Thus, we enable the first thorough evaluation of vision-and-language model performance across broad driving tasks, from basic regulatory questions and signs to complex multimodal reasoning tasks. Our contributions are summarized as follows:

•

We introduce DriveQA, a large-scale benchmark featuring both text-based (DriveQA-T) and vision-based (DriveQA-V) driving knowledge assessments. To ensure broad coverage of traffic regulations, right-of-way rules, and rare driving scenarios, we leverage synthetic procedural data generation with comprehensive traffic reasoning, controlled variations (e.g., sign placement and weather), and new 3D sign assets integrated into CARLA [21], as well as manually annotated real-world data from Mapillary [57]. DriveQA covers 19 question categories, 220 traffic signs, and 474K samples.

•

We benchmark state-of-the-art LLMs and MLLMs on DriveQA to uncover that while these models perform well on basic traffic rules, they struggle with numerical precision, right-of-way reasoning, spatial awareness, and environmental sensitivity (e.g., time-of-day, perspective, and geometric layouts). Our findings suggest that MLLMs inherit limited traffic knowledge from pretraining and require fine-tuning for our task.

•

We demonstrate the effectiveness of DriveQA pretraining; models trained on our text and purely synthetic data demonstrate improved performance across various real-world driving tasks [87, 88]. We show that pretraining on DriveQA improves the performance on both trajectory prediction and driving action reasoning tasks. This highlights its role in evaluating and enhancing multimodal reasoning, and as a step toward bridging theory and practice in embodied AI systems that can learn to make decisions in the real-world based on text or synthetic data.

2 Related Work

Based on our survey of MLLM-based studies and VQA benchmarks for autonomous driving below, we find prior work rarely addressed traffic rules, signage, and right-of-way principles within their driving knowledge assessments. Relevant related benchmarks are compared in Table 1.

Multimodal Large Language Models: Our study diagnoses multimodal reasoning capabilities in MLLMs [38, 17, 61, 2, 97, 78, 80, 95]. A typical MLLM architecture comprises three main modules: a pre-trained modality encoder, a pre-trained language model, and a modality projector that aligns them. The modality encoder processes non-textual inputs, such as images, transforming them into representations compatible with the language models. Vision Transformer (ViT) [23] is widely used to extract image features. For example, CLIP [65] leverages ViT as its visual encoder to transform images into feature representations that align with text through extensive pre-training on large-scale image-text pairs. The modality projector aligns encoder outputs with the language model, enabling integration of modality data with text. A common approach is to use a set of learnable query tokens to extract information in a query-driven manner [7], which has been employed by a variety of models [43, 92, 18, 40, 10, 90, 13, 91]. Additionally, methods may design MLPs to transform the high-dimensional input features into a unified representation [50, 73, 62, 2]. Our systematic study controlling for variations in QA category and image factors reveals limitations of current alignment mechanisms in supporting multimodal or spatial reasoning.

MLLM-based Driving Agents: While recent advancements have applied MLLMs to autonomous driving tasks, most focus on leveraging reasoning and language understanding capabilities to improve driving decisions in narrow tasks [12, 15, 16, 28, 53, 70, 82, 89, 98]. For instance, several vision-and-language agents for motion planning and decision-making have been proposed and evaluated on datasets such as nuScenes [6, 53, 89, 28, 4, 78, 51, 54, 42, 67]. The key hypothesis in such studies is that MLLMs can inherit general-purpose reasoning and knowledge from pretraining; however, our findings suggest that while they may grasp basic traffic concepts, their ability to apply traffic reasoning in driving-specific scenarios remains limited. Moreover, these works have not explicitly addressed MLLMs’ ability to comprehend diverse traffic rules and regulations-a critical requirement for safe driving.

Datasets for Autonomous Driving: Several real-world, synthetic, and VQA benchmarks for autonomous driving are currently being used to evaluate driving models, including KITTI [30, 45], Waymo Open [74], Argoverse [8, 83], and nuScenes [6]. However, few incorporate more than a handful of traffic rules, e.g., researchers may evaluate collision on nuScenes [6, 94, 34, 78, 20, 99, 93], yet lack coverage and exclude explicitly evaluating for traffic signs or right-of-way reasoning. Crowdsourced benchmarks such as Mapillary [57], which we augment with VQA annotations, are broad but still lack in long-tail events, motivating the use of synthetic benchmarks. Yet, prior simulation-based studies (e.g., [21, 1, 35, 66, 96, 71, 27, 68]) have only accounted for a handful of potential regulatory and safety violations. For instance, while CARLA [21] enables controllable and diverse data generation (e.g., perspectives, scenarios, weather), most traffic signs are missing in CARLA, a limitation addressed by our work. The development of MLLMs and their applications in autonomous driving lead to the emergence of driving vision-language datasets [64, 55, 72, 5, 78, 80, 69] specifically designed to support vision understanding and reasoning in complex driving scenarios. However, here as well existing efforts focus on scene understanding, perception and basic planning (i.e., collision avoidance, intersection boundary [44]), neglecting reasoning about traffic rules and regulations (i.e., reacting safely to no-entry signs, maintaining speed limit, etc.) which is a foundational driving test for humans.

3 A Multimodal Driver Knowledge Test

In this section, we outline our scalable data collection and annotation process. Our dataset consists of QA pairs that cover essential aspects of real-world driving knowledge. As illustrated in Fig. 2, our dataset comprises two tasks: DriveQA-T, which consists of text-based QA pairs on general driving rules, and DriveQA-V, focusing on visual (image-based) QA related to traffic sign comprehension and right-of-way scenarios. We adhere as closely as possible to standard driving knowledge tests to ensure meaningful comparisons to human performance on these assessments, and generate a diverse set of multiple-choice questions. We note that there are commercial driver knowledge tests available [56, 76], however these are closed-source. To ensure in-depth analysis, we further provide reasoning for ground-truth answers on both tasks. This design is intended to provide a holistic and systematic analysis of both LLMs and MLLMs in decision-oriented tasks. Ultimately, our overarching goal is to enable novel mechanisms to teach MLLMs real-world tasks, e.g., through text descriptions or synthetic examples.

Text-based QA Dataset—DriveQA-T: Our DriveQA-T dataset contains a total of 26K QA pairs covering different general driving topics, including traffic lights, traffic signs, parking, regulation, and symbols (see our supplementary for full details on the categories). Each QA pair contains an explanation for the correct answer, which can be used to evaluate the reasoning capabilities of LLMs. To curate the QA pairs, we first gathered 51 official driver’s handbooks from all 50 US states plus DC. Although our data set is US-centric, it can inform the construction of additional international datasets in the future. We build DriveQA-T in three steps. First, we generate questions automatically by prompting GPT-4o [59, 98] with the driver’s handbooks as context, and then conduct manual quality verification based on the driver’s handbooks. Quality checks were performed in rounds, where each verifier went through questions, and then ambiguous or inconsistent cases were discarded. Additional details about this process can be found in the supplementary. We note that humans, once trained, can obtain 100% on our benchmark. We categorized the text data into 19 classes, grouped into five main categories, as shown in Fig. 3. A summarized description of the dataset is depicted in Fig. 4, showing a focus on traffic participants and intersections (e.g., right-of-way, yielding behaviors).

Multimodal Extension With DriveQA-V: Driver knowledge tests [56, 76] are primarily text-based, e.g., with a full description of objects and spatial layout information in text. However, certain questions particularly related to traffic signs and right-of-way, test understanding through graphical illustrations accompanying text information. DriveQA-V focuses on these two types of questions. To ensure comprehensive coverage through procedural variations (e.g., camera perspectives, time of day, weather, distance), images are collected with the open-source Unreal Engine-based CARLA simulator [21]. However, since CARLA was not originally designed with extensive traffic rule knowledge, e.g., traffic signs, we augment the simulation with additional 3D assets and automatic traffic rule scripts. Due to procedural and synthetic generation, in addition to aligned text-image VQA pairs, we are able to collect full state information, such as camera perspective, distance from ego-vehicle, and sign type. Specifically, we insert 220 US-based traffic sign models into the map, simulator, and spawn an ego vehicle to collect sensor readings. For right-of-way questions, we identify intersections in the CARLA maps and randomly spawn vehicles on each side of the intersection. Each vehicle varies in color to facilitate identification in the questions.

4 Method

In this section, we describe our approach to evaluating models on our proposed dataset. The methodology includes question-type classification, model evaluation using Chain of Thought (CoT) [81] reasoning, Retrieval-Augmented Generation (RAG) [41] techniques, and model fine-tuning on the benchmark.

Question-Type Classification: To precisely assess model performance across specific traffic rule categories, we divide the DriveQA-T dataset questions into types. This enables us to assess performance on specific traffic rule categories, thereby providing a nuanced understanding of how well they generalize across various traffic contexts. Specifically, we apply hierarchical clustering [58] to organize questions into semantically coherent groups, ensuring that similar questions are grouped together based on their thematic content. We begin by generating embeddings for each question using BERT [37], which effectively captures the semantic nuances of each question and represents them in a high-dimensional embedding space. By applying hierarchical clustering to these embeddings, we identify clusters that correspond to distinct traffic rule topics, such as traffic signals, speed limits, parking regulations, etc. To interpret and label each cluster, we use KeyBERT [31] to extract semantic keywords for each group, combined with sample questions from each cluster, finally we assign descriptive types to the clusters. In DriveQA-V, we assign types manually (see supplementary for more details).

Fine-Tuning: Off-the-Shelf models were trained on open web data, thus having potential access to driver handbooks and tests. To further investigate the role of training data for our task, we also fine-tune models on our dataset. We find this to enhance, but not fully address, models’ ability to handle the specific complexities of traffic scenarios. We employ LoRA [33], which reduces the number of trainable parameters by introducing low-rank updates to the weight matrices in transformer layers, allowing efficient fine-tuning without requiring extensive computational resources.

CoT and RAG: We employ CoT reasoning and RAG-based context in our evaluation. CoT reasoning guides the LLMs and MLLMs through each reasoning step in a logical progression, which allows us to test their capacity for logical consistency, especially in multi-vehicle or rule-based scenarios. We also evaluate the produced reasoning, e.g., to ensure correct answers are selected for the correct reasons. For RAG, we construct a retrieval corpus derived from the official driver’s handbooks of all 50 U.S. states and DC. This corpus serves as a reliable, contextually relevant reference to provide the models with related context when answering questions. By retrieving it for each question, RAG-based context grounds the model’s responses in actual regulations, aiming to enhance both the accuracy and contextual relevance of answers.

5 Experiments

5.1 Setup

We evaluate our dataset on various MLLMs. For each model type, we consider both open-source and closed-source variants, applying CoT and RAG strategies to structure the input prompts. Our evaluation is based not only on testing the original capabilities of each off-the-shelf model but also on a comprehensive analysis of their performance after fine-tuning the open-source checkpoint on our dataset.

Prompt Structure: We designed four prompt structures to explore model performance under varying levels of reasoning and contextual support. Beginning with a basic prompt, we tested standard question-answering without additional guidance. Building on this, we introduced a CoT prompt to encourage step-by-step reasoning, aiming to enhance answer consistency in complex scenarios. To further improve contextual relevance, we combined CoT with RAG-based context by retrieving pertinent information from drivers’ handbooks, thereby grounding the responses in real-world regulations. Finally, we assessed the impact of RAG-based context alone, where we provided retrieved contextual information without step-by-step reasoning. These four prompts allowed us to examine the models’ capabilities in integrating both reasoning and factual support effectively.

Metrics: To comprehensively evaluate our model’s performance on both the DriveQA-T and DriveQA-V datasets, we use accuracy as the primary metric, reflecting the model’s ability to correctly answer a wide range of driving-related questions across textual and visual domains. For the DriveQA-T dataset, we place an additional emphasis on reasoning capability, as each question includes an accompanying explanation. To measure the relevance of the model’s reasoning, we employ BLEU-4 [60] and ROUGE-L [46], providing insights into the model’s ability to generate responses that are not only accurate but also demonstrate high-quality reasoning aligned with expected standards.

5.2 Results

Performance of LLMs on DriveQA-T: Table 2 presents the performance of various models on our DriveQA-T dataset. Phi-3.5-mini and Gemma-2 (9B) generally perform better across most categories than other models, demonstrating their ability to comprehend driving rules. Observably, models with CoT reasoning and RAG-based context tend to achieve higher accuracy, suggesting that these enhancements contribute to performance improvements. This trend highlights the importance of advanced reasoning and contextual retrieval for complex, rule-based tasks. While certain models show promising results in accurately interpreting and following traffic regulations, consistent performance across diverse driving-related categories may still require further refinement.

As shown in Table 2, all models exhibit a significant improvement in overall accuracy after fine-tuning. However, they still struggle with numerical questions, such as those in the “Limits” and “Alcohol” categories. This difficulty suggests that models may lack the precise numerical reasoning capabilities needed to respond accurately to questions involving specific values or quantitative thresholds, which are critical in understanding speed limits, alcohol levels, and other regulatory metrics. Furthermore, for certain decision-making-focused categories, including “Passing”, “Signs” and “Turning”, most models achieve only slightly above accuracy of 80%. These categories are crucial for safe driving in practical conditions, highlighting the models’ continuous shortcomings in handling nuanced, context-dependent traffic rules despite fine-tuning improvements.

CoT Reasoning of LLMs on DriveQA-T: Table 3 shows the evaluation results of CoT reasoning on the DriveQA-T dataset. Most models show improvements when using RAG-based context. Specifically, GPT-4o achieves the highest BLEU-4 and ROUGE-L scores among the off-the-shelf models, reaching a BLEU-4 score of 0.3989 and a ROUGE-L score of 0.5393 with RAG-based context. After fine-tuning, Gemma-2 (9B) surpasses GPT-4o in both BLEU-4 and ROUGE-L scores, demonstrating the effectiveness of fine-tuning in adapting the model specifically to traffic rules and enabling it to provide more accurate, context-specific explanations. However, these scores still fall short of what would be considered high-quality for generating fully robust and exhaustive explanations, indicating that the models are not yet capable of consistently producing complete and nuanced responses. Furthermore, the lower scores of Llama-3.2 and Phi-3.5-mini after fine-tuning suggest potential issues. One possible reason for this decline is overfitting the fine-tuning dataset, which may cause the models to become too specialized and lose some of their generalization capabilities. This overfitting can result in explanations that are overly tailored to specific training examples, reducing the models’ ability to produce flexible, broadly applicable responses. Additionally, fine-tuning may interfere with the effectiveness of RAG-based retrieval, leading to less relevant contextual information and, consequently, lower alignment with ground-truth explanations. These factors highlight the challenges of balancing specificity and generalization in fine-tuning for complex, rule-based tasks.

Performance of MLLMs on DriveQA-V: Table 4 presents the accuracy of MLLM models on DriveQA-V, which assesses model performance across intersection types and traffic sign categories. The dataset divides intersections into 4 different categories based on the intersection types and camera perspective, and 4 different categories of signs based on most states’ driver handbooks. Among the off-the-shelf models, GPT-4o achieves the highest accuracy in all intersection and sign categories, with a particularly strong performance in the sign types (around 94%). This suggests that GPT-4o possesses a deep understanding of signs. However, for intersection-based categories, the performance remains relatively low, with the highest off-the-shelf accuracy of 60.36% in the “T-Top” category. Most models except GPT-4o perform below random guess level (25%) in several categories due to bias [2]. This indicates that off-the-shelf models struggle to fully understand and apply traffic rules in intersection scenarios, which often require more complex visual-spatial reasoning. Additionally, Fine-tuning significantly enhances model performance across all categories. All models achieve notable improvements after fine-tuning, which demonstrates that fine-tuning effectively adapts MLLMs to handle the visual-spatial and contextual nuances for the accurate understanding of both right-of-way rules and traffic signs.

Despite these gains, there remain limitations. Both LLaVA-1.5 and VILA-1.5, even after fine-tuning, achieve only moderate accuracy in intersection categories, with particularly lower performance on first-person perspective images. This suggests that the models still struggle with complex, multi-vehicle intersection scenarios, where perspective and spatial relationships are critical. For the traffic signs recognition task, We can observe the best training performance in the Guide Signs and Temporary Traffic Control categories. This is because guide signs typically feature simpler images with blue backgrounds, while temporary traffic control signs have distinct orange backgrounds and normally larger sign sizes, making them easier for the model to learn and generalize. However, many critical traffic signs fall under the Regulatory and Warning categories, including speed limit, no entry, etc. As shown in Table 5, among the ten worst-performing sign types, only “Trauma Center” belongs to the Guide Signs category, with the most challenging signs coming from the Regulatory and Warning categories. This highlights significant room for improvement in the current visual model. While fine-tuned models perform well on “Guide” and “Temporary Control” signs, their performance does not consistently exceed 90%. Based on both Table 2 and Table 4 and shown in categories’ accuracy, the zero-shot performance on DriveQA-V is much lower than on DriveQA-T. This indicates that current MLLMs’ fine-grained perception and visual reasoning capabilities are nascent, exhibiting systematic shortcomings due to CLIP’s failures.

Role of Difficulty and Distractors: To further increase the evaluation difficulty, we adopt a negative sampling strategy to construct more challenging distractors. Specifically, for DriveQA-T, we construct a difficult question set containing 1249 questions. For DriveQA-V (Signs), we leverage metadata, i.e., the ground-truth traffic sign artifact categories to ensure that distractors belong to the same category as the correct answer. For numeric signs, all candidates are constrained to numerical values to further increase ambiguity. Evaluation results on GPT-4o and a representative open-source baseline are summarized in Table 6.

Sim-to-Real Transferability: We evaluate our models finetuned on DriveQA on a curated dataset by us from Mapillary [57] (1303 annotated images, including 166 sign types), as shown in Table 7. Additionally, results in Table 8 show the downstream trajectory planning task with OpenEMMA [87] on nuScenes dataset, where our task-agnostic QA model is intentionally only fine-tuned on DriveQA but tested zero-shot in waypoint prediction to measure generalization. Reduced L2 errors show the transferability of our dataset. However, nuScenes lacks diversity and is generally uneventful (e.g., minimal signage), while our benchmark exhaustively covers all traffic rules and scenarios. We therefore also make evaluations on the more diverse datasets of BDD-OIA [88] as shown in Table 9. After fine-tuning on DriveQA, the models achieve better performance in cross-domain real-world driving tasks, demonstrating the effectiveness of our data in improving the understanding of traffic rules and real-world generalizability. We provide additional analysis in the supplementary.

Limitation: While our benchmark, models, and analysis provide insights into the performance of models in understanding diverse traffic rules for autonomous driving, there are several limitations, which we plan to address in future work. First, the benchmark primarily evaluates static, structured knowledge of traffic rules. While this is aligned with standard driving knowledge tasks, there is an opportunity to leverage video-based models in the future (e.g., using our augmented CARLA simulation). Our analysis demonstrates that incorporating knowledge from text does indeed transfer to dynamic settings in nuScenes, yet vision-based reasoning remains nascent in MLLMs (or even spatial reasoning [2]). Moreover, our study highlights weaknesses in numerical reasoning and spatial awareness yet does not explore potential mitigation strategies beyond fine-tuning. The reliance on synthetic data also raises concerns about domain adaptation. Nonetheless, simulation data is crucial for scalability, as we are able to control for various variations, including occlusions and ambiguous signage, which may be rare in real-world benchmarks. Finally, while the dataset includes controlled variations in environmental factors like lighting and weather, it does not extensively cover edge cases such as emergency vehicle interactions (only covered in DriveQA-T) or pedestrian intent recognition. The models also exhibit biases towards frequently seen traffic patterns, which may result in poor generalization to geographically diverse driving environments with different road layouts and regulations.

6 Conclusion

In this paper, we introduce DriveQA, a novel benchmark for autonomous driving that evaluates models through text-based (DriveQA-T) and visual-text (DriveQA-V) question-answering, focusing on general traffic rules, traffic signs, and complex right-of-way scenarios. Our evaluation of state-of-the-art models reveals critical limitations: even fine-tuned models struggle with nuanced right-of-way scenarios, falling short of the reasoning needed for safe driving guidance. Our work deliberately focuses on static visual and textual inputs, i.e., to align with real-world driver knowledge tests. While video-based learning is not required to adhere to these standards, future research could explore hybrid frameworks incorporating video to address time-dependent scenarios. Ultimately, while humans can learn traffic rules through textual instruction and contextual practice, current models remain overly reliant on observational training data. Models thus lack the ability to internalize explicit textual knowledge and apply it effectively in decision-making. This suggests that learning traffic rules from text remains an underexplored paradigm, highlighting the need for methods that better integrate language understanding with spatial reasoning.

Acknowledgments: We thank the National Science Foundation (award IIS-2152077) and Red Hat Collaboratory (award #2024-01-RH02) for supporting this research.

Bibliography99

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1car [2022] Carla autonomous driving leaderboard. https://leaderboard.carla.org/ , 2022.
2ton [2024] Eyes wide shut? exploring the visual shortcomings of multimodal llms. In CVPR , 2024.
3Abdin et al. [2024] Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ar Xiv:2404.14219 , 2024.
4Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv:2303.08774 , 2023.
5Arai et al. [2024] Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. ar Xiv:2408.10845 , 2024.
6Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In CVPR , 2020.
7Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV , 2020.
8Chang et al. [2019] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. In CVPR , 2019.