Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios
Charles Corbi\`ere, Simon Roburin, Syrielle Montariol, Antoine, Bosselut, Alexandre Alahi

TL;DR
This paper introduces RIV-CoT, a retrieval-based visual reasoning method that enhances large vision-language models' ability to perform complex reasoning in real-world driving scenarios, demonstrated on new and existing datasets.
Contribution
It presents RIV-CoT, a novel retrieval-based interleaved visual chain-of-thought approach, and introduces the DrivingVQA dataset for real-world visual reasoning evaluation.
Findings
RIV-CoT improves answer accuracy by 3.1% over vanilla CoT.
RIV-CoT enhances reasoning accuracy by 4.6%.
Method scales effectively to larger datasets using pseudo-labels.
Abstract
While chain-of-thought (CoT) prompting improves reasoning in large language models, its effectiveness in vision-language models (VLMs) remains limited due to over-reliance on textual cues and memorized knowledge. To investigate the visual reasoning capabilities of VLMs in complex real-world scenarios, we introduce DrivingVQA, a visual question answering dataset derived from driving theory exams, which contains 3,931 multiple-choice problems with expert-written explanations and grounded entities relevant to the reasoning process. Leveraging this dataset, we propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables VLMs to reason using visual crops corresponding to these relevant entities. Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting. Furthermore, we demonstrate that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language · Language, Metaphor, and Cognition
