Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering
Jiong Wang, Zhou Zhao, Weike Jin

TL;DR
This paper introduces a weakly supervised question grounding method for multi-modal video question answering that leverages frame-subtitle self-supervision to improve both QA accuracy and temporal localization without requiring costly annotations.
Contribution
It proposes a novel weakly supervised question grounding framework using frame-subtitle self-supervision, reducing reliance on expensive temporal annotations.
Findings
Achieves comparable question grounding performance to fully supervised methods.
Improves QA and grounding accuracy with FS self-supervision on TVQA datasets.
Effective in both QA-supervision only and full-supervision settings.
Abstract
Multi-modal video question answering aims to predict correct answer and localize the temporal boundary relevant to the question. The temporal annotations of questions improve QA performance and interpretability of recent works, but they are usually empirical and costly. To avoid the temporal annotations, we devise a weakly supervised question grounding (WSQG) setting, where only QA annotations are used and the relevant temporal boundaries are generated according to the temporal attention scores. To substitute the temporal annotations, we transform the correspondence between frames and subtitles to Frame-Subtitle (FS) self-supervision, which helps to optimize the temporal attention scores and hence improve the video-language understanding in VideoQA model. The extensive experiments on TVQA and TVQA+ datasets demonstrate that the proposed WSQG strategy gets comparable performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
