Self-supervised pre-training and contrastive representation learning for   multiple-choice video QA

Seonhoon Kim; Seohyeong Jeong; Eunbyul Kim; Inho Kang; Nojun Kwak

arXiv:2009.08043·cs.CL·December 15, 2020·5 cites

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Seonhoon Kim, Seohyeong Jeong, Eunbyul Kim, Inho Kang, Nojun Kwak

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel self-supervised pre-training and contrastive learning framework for multiple-choice video QA, improving model performance by leveraging broader contextual understanding and focused attention mechanisms.

Contribution

The paper proposes a new training scheme combining self-supervised pre-training and supervised contrastive learning for video QA, enhancing understanding without additional annotations.

Findings

01

Achieved state-of-the-art results on TVQA, TVQA+, and DramaQA datasets.

02

Demonstrated effectiveness of contrastive learning with masking noise.

03

Validated the benefit of locally aligned attention for relevant frame focus.

Abstract

Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions. In this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning. In the self-supervised pre-training stage, we transform the original problem format of predicting the correct answer into the one that predicts the relevant question to provide a model with broader contextual inputs without any further dataset or annotation. For contrastive learning in the main stage, we add a masking noise to the input corresponding to the ground-truth answer, and consider the original input of the ground-truth answer as a positive sample, while treating the rest as negative samples. By mapping the positive sample…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Supervised Pre-Training and Contrastive Representation Learning for Multiple-Choice Video QA· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning