Long-Form Video-Language Pre-Training with Multimodal Temporal   Contrastive Learning

Yuchong Sun; Hongwei Xue; Ruihua Song; Bei Liu; Huan Yang; Jianlong Fu

arXiv:2210.06031·cs.CV·March 3, 2023·33 cites

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces LF-VILA, a novel long-form video-language pre-training model that employs multimodal temporal contrastive learning and hierarchical attention to improve understanding of long videos, achieving state-of-the-art results.

Contribution

The paper presents a new pre-training framework for long-form videos using multimodal temporal contrastive loss and hierarchical attention, addressing challenges of modeling long-range dependencies efficiently.

Findings

01

Achieves 16.1% improvement on ActivityNet paragraph-to-video retrieval.

02

Achieves 2.4% improvement on How2QA long-form video question-answering.

03

Outperforms previous methods on multiple long-form video-language tasks.

Abstract

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/xpretrain
pytorchOfficial

Videos

Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam · Dense Connections · Softmax · Label Smoothing