Optimizing Visual Question Answering Models for Driving: Bridging the   Gap Between Human and Machine Attention Patterns

Kaavya Rekanar; Martin Hayes; Ganesh Sistu; Ciaran Eising

arXiv:2406.09203·cs.CV·June 14, 2024·3 cites

Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

Kaavya Rekanar, Martin Hayes, Ganesh Sistu, Ciaran Eising

PDF

Open Access

TL;DR

This paper examines how visual question answering models for autonomous driving can be improved by aligning their attention patterns more closely with human focus, leading to better accuracy and trust.

Contribution

It introduces a filter-based method to optimize VQA model attention mechanisms, bridging the gap between human and machine attention patterns in driving scenarios.

Findings

01

Enhanced model accuracy with filter integration

02

Attention patterns more aligned with human focus

03

Improved feature prioritization in VQA models

Abstract

Visual Question Answering (VQA) models play a critical role in enhancing the perception capabilities of autonomous driving systems by allowing vehicles to analyze visual inputs alongside textual queries, fostering natural interaction and trust between the vehicle and its occupants or other road users. This study investigates the attention patterns of humans compared to a VQA model when answering driving-related questions, revealing disparities in the objects observed. We propose an approach integrating filters to optimize the model's attention mechanisms, prioritizing relevant objects and improving accuracy. Utilizing the LXMERT model for a case study, we compare attention patterns of the pre-trained and Filter Integrated models, alongside human answers using images from the NuImages dataset, gaining insights into feature prioritization. We evaluated the models using a Subjective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Speech and dialogue systems

MethodsLearning Cross-Modality Encoder Representations from Transformers