LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal   Modeling

Dongsheng Chen; Chaofan Tao; Lu Hou; Lifeng Shang; Xin Jiang; Qun Liu

arXiv:2210.11929·cs.CV·October 24, 2022·1 cites

LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

Dongsheng Chen, Chaofan Tao, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu

PDF

Open Access

TL;DR

LiteVL is a computationally efficient video-language model that adapts a pre-trained image-language model for video tasks by adding temporal attention and adaptive pooling, achieving superior results without heavy pre-training.

Contribution

It introduces a novel adaptation method for image-language models to handle video tasks efficiently, eliminating the need for extensive pre-training.

Findings

01

Outperforms previous models on text-video retrieval

02

Achieves better results on video question answering

03

Operates without heavy video-language pre-training

Abstract

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsBLIP: Bootstrapping Language-Image Pre-training