Video Question Answering Using CLIP-Guided Visual-Text Attention

Shuhong Ye; Weikai Kong; Chenglin Yao; Jianfeng Ren; Xudong Jiang

arXiv:2303.03131·cs.CV·March 9, 2023·1 cites

Video Question Answering Using CLIP-Guided Visual-Text Attention

Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang

PDF

Open Access

TL;DR

This paper introduces a CLIP-guided visual-text attention mechanism for VideoQA, leveraging cross-domain learning to improve answer prediction by integrating domain-specific and general knowledge features.

Contribution

It proposes a novel CLIP-guided cross-domain learning approach for VideoQA that enhances cross-modal attention and improves performance on benchmark datasets.

Findings

01

Outperforms state-of-the-art methods on MSVD-QA and MSRVTT-QA datasets

02

Effective integration of general and target domain features improves accuracy

03

Demonstrates the benefit of CLIP-guided attention in VideoQA tasks

Abstract

Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Adam · Dropout · Softmax · TimeSformer · Dense Connections