Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu

TL;DR
This paper introduces LF-VILA, a novel long-form video-language pre-training model that employs multimodal temporal contrastive learning and hierarchical attention to improve understanding of long videos, achieving state-of-the-art results.
Contribution
The paper presents a new pre-training framework for long-form videos using multimodal temporal contrastive loss and hierarchical attention, addressing challenges of modeling long-range dependencies efficiently.
Findings
Achieves 16.1% improvement on ActivityNet paragraph-to-video retrieval.
Achieves 2.4% improvement on How2QA long-form video question-answering.
Outperforms previous methods on multiple long-form video-language tasks.
Abstract
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Dropout · Adam · Dense Connections · Softmax · Label Smoothing
