VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive   Learning

Hao Tan; Jie Lei; Thomas Wolf; Mohit Bansal

arXiv:2106.11250·cs.CV·June 22, 2021·33 cites

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

PDF

Open Access 1 Repo

TL;DR

VIMPAC introduces a novel video pre-training approach combining masked token prediction with contrastive learning, effectively capturing both local and global video content for improved understanding.

Contribution

The paper proposes a block-wise masking strategy and an augmentation-free contrastive learning method for better video representation learning.

Findings

01

Achieves state-of-the-art results on SSV2 and Diving48 datasets.

02

Demonstrates the effectiveness of block-wise masking in capturing spatio-temporal correlations.

03

Provides detailed analysis on model scalability and pre-training strategies.

Abstract

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

airsplay/vimpac
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning · VQ-VAE