VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh,, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

TL;DR
VLM introduces a versatile, task-agnostic pre-training method for video and language understanding that improves performance across multiple tasks by innovative masking schemes and flexible input handling.
Contribution
It proposes a novel, task-agnostic pre-training approach with new masking schemes that enhance cross-modal learning and flexibility for various video-language tasks.
Findings
Outperforms previous methods on multiple tasks
Demonstrates strong generalization across diverse video-language benchmarks
Achieves state-of-the-art results in retrieval and understanding tasks
Abstract
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
