VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning
Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

TL;DR
VIMPAC introduces a novel video pre-training approach combining masked token prediction with contrastive learning, effectively capturing both local and global video content for improved understanding.
Contribution
The paper proposes a block-wise masking strategy and an augmentation-free contrastive learning method for better video representation learning.
Findings
Achieves state-of-the-art results on SSV2 and Diving48 datasets.
Demonstrates the effectiveness of block-wise masking in capturing spatio-temporal correlations.
Provides detailed analysis on model scalability and pre-training strategies.
Abstract
Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task on discretized video tokens generated via VQ-VAE. Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e.g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations. To deal with this issue, we propose a block-wise masking strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture the global content by predicting whether the video clips are sampled from the same video. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning · VQ-VAE
