Self-supervised pre-training and contrastive representation learning for multiple-choice video QA
Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim, Inho Kang, Nojun Kwak

TL;DR
This paper introduces a novel self-supervised pre-training and contrastive learning framework for multiple-choice video QA, improving model performance by leveraging broader contextual understanding and focused attention mechanisms.
Contribution
The paper proposes a new training scheme combining self-supervised pre-training and supervised contrastive learning for video QA, enhancing understanding without additional annotations.
Findings
Achieved state-of-the-art results on TVQA, TVQA+, and DramaQA datasets.
Demonstrated effectiveness of contrastive learning with masking noise.
Validated the benefit of locally aligned attention for relevant frame focus.
Abstract
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions. In this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning. In the self-supervised pre-training stage, we transform the original problem format of predicting the correct answer into the one that predicts the relevant question to provide a model with broader contextual inputs without any further dataset or annotation. For contrastive learning in the main stage, we add a masking noise to the input corresponding to the ground-truth answer, and consider the original input of the ground-truth answer as a positive sample, while treating the rest as negative samples. By mapping the positive sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
