Large Language Models are Temporal and Causal Reasoners for Video   Question Answering

Dohwan Ko; Ji Soo Lee; Wooyoung Kang; Byungseok Roh; Hyunwoo J. Kim

arXiv:2310.15747·cs.CV·November 7, 2023·2 cites

Large Language Models are Temporal and Causal Reasoners for Video Question Answering

Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces Flipped-VQA, a novel framework that enhances Video Question Answering by encouraging models to understand complex relationships among video, question, and answer, reducing linguistic bias and improving performance across multiple benchmarks.

Contribution

The paper proposes Flipped-VQA, a general framework for LLMs that improves VideoQA by predicting all triplet combinations, effectively leveraging linguistic priors while mitigating bias.

Findings

01

LLaMA-VQA outperforms existing models on five VideoQA benchmarks.

02

Flipped-VQA improves performance across various LLMs like OPT and GPT-J.

03

The framework reduces linguistic bias and enhances understanding of complex video-question-answer relationships.

Abstract

Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $linguistic shortcuts$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $i.e.$ , $linguistic bias$ , while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $⟨$ V, Q, A $⟩$ triplet by flipping the source pair and the target label to understand their complex relationships, $i.e.$ , predict A, Q, and V given a VQ, VA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlvlab/Flipped-VQA
pytorchOfficial

Datasets

ikodoh/Flipped-VQA-Data
dataset· 104 dl
104 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning