Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving
Kaavya Rekanar, Ciar\'an Eising, Ganesh Sistu, Martin Hayes

TL;DR
This paper provides an initial performance analysis of three pre-trained VQA models in autonomous driving scenarios, focusing on their response similarity to expert answers and the impact of multimodal architecture features.
Contribution
It introduces a preliminary evaluation of ViLBERT, ViLT, and LXMERT models for driving-related VQA tasks, highlighting the influence of cross-modal attention and fusion techniques.
Findings
Models with cross-modal attention perform better.
Late fusion techniques show promising results.
Analysis sets the stage for comprehensive future studies.
Abstract
This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsVision-and-Language BERT · Learning Cross-Modality Encoder Representations from Transformers
