VLM: Task-agnostic Video-Language Model Pre-training for Video   Understanding

Hu Xu; Gargi Ghosh; Po-Yao Huang; Prahal Arora; Masoumeh Aminzadeh,; Christoph Feichtenhofer; Florian Metze; Luke Zettlemoyer

arXiv:2105.09996·cs.CV·October 4, 2021·6 cites

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh,, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer

PDF

Open Access 1 Repo

TL;DR

VLM introduces a versatile, task-agnostic pre-training method for video and language understanding that improves performance across multiple tasks by innovative masking schemes and flexible input handling.

Contribution

It proposes a novel, task-agnostic pre-training approach with new masking schemes that enhance cross-modal learning and flexibility for various video-language tasks.

Findings

01

Outperforms previous methods on multiple tasks

02

Demonstrates strong generalization across diverse video-language benchmarks

03

Achieves state-of-the-art results in retrieval and understanding tasks

Abstract

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pytorch/fairseq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition