Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Jiong Wang; Zhou Zhao; Weike Jin

arXiv:2209.03609·cs.CV·September 9, 2022·1 cites

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Jiong Wang, Zhou Zhao, Weike Jin

PDF

Open Access

TL;DR

This paper introduces a weakly supervised question grounding method for multi-modal video question answering that leverages frame-subtitle self-supervision to improve both QA accuracy and temporal localization without requiring costly annotations.

Contribution

It proposes a novel weakly supervised question grounding framework using frame-subtitle self-supervision, reducing reliance on expensive temporal annotations.

Findings

01

Achieves comparable question grounding performance to fully supervised methods.

02

Improves QA and grounding accuracy with FS self-supervision on TVQA datasets.

03

Effective in both QA-supervision only and full-supervision settings.

Abstract

Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition