Contrastive Video Question Answering via Video Graph Transformer

Junbin Xiao; Pan Zhou; Angela Yao; Yicong Li; Richang Hong; Shuicheng; Yan; Tat-Seng Chua

arXiv:2302.13668·cs.CV·July 12, 2023·1 cites

Contrastive Video Question Answering via Video Graph Transformer

Junbin Xiao, Pan Zhou, Angela Yao, Yicong Li, Richang Hong, Shuicheng, Yan, Tat-Seng Chua

PDF

Open Access 1 Repo

TL;DR

This paper introduces CoVGT, a novel contrastive video question answering model that uses a video graph transformer to improve fine-grained reasoning and outperforms previous methods with less data.

Contribution

It presents a dynamic graph transformer for detailed video encoding and contrastive learning for video-text matching, advancing VideoQA performance and data efficiency.

Findings

01

CoVGT surpasses previous models on video reasoning tasks.

02

It achieves high performance with significantly less data.

03

The model benefits from cross-modal pretraining.

Abstract

We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

doc-doc/covgt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Laplacian EigenMap · Absolute Position Encodings · Label Smoothing · Softmax · Adam · Layer Normalization · Residual Connection