Can I Trust Your Answer? Visually Grounded Video Question Answering

Junbin Xiao; Angela Yao; Yicong Li; Tat Seng Chua

arXiv:2309.01327·cs.CV·April 2, 2024·1 cites

Can I Trust Your Answer? Visually Grounded Video Question Answering

Junbin Xiao, Angela Yao, Yicong Li, Tat Seng Chua

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper evaluates the grounding capabilities of vision-language models in video question answering, revealing their weaknesses and proposing a new method to improve the reliability of their visual explanations.

Contribution

It introduces NExT-GQA, a dataset with temporal grounding labels, and proposes a grounded-QA method that enhances model grounding and answer accuracy.

Findings

01

Current VLMs perform well in QA but poorly in visual grounding.

02

Grounded-QA improves both answer accuracy and grounding quality.

03

Models are weak in substantiating answers despite high QA performance.

Abstract

We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5 $K$ temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a series of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are extremely weak in substantiating the answers despite their strong QA performance. This exposes the limitation of current VLMs in making reliable predictions. As a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

doc-doc/next-gqa
pytorchOfficial

Datasets

jinyoungkim/NExT-GQA
dataset· 911 dl
911 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning