Masked Feature Prediction for Self-Supervised Visual Pre-Training
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph, Feichtenhofer

TL;DR
MaskFeat introduces a self-supervised pre-training method for video models that predicts masked features, notably using HOG descriptors, leading to state-of-the-art results on multiple video benchmarks without extra supervision.
Contribution
The paper proposes Masked Feature Prediction (MaskFeat), a novel self-supervised learning approach that predicts features of masked regions, demonstrating effectiveness with HOG features and large-scale Transformer models.
Findings
Achieved 86.7% on Kinetics-400 with MViT-L
Attained 88.3% on Kinetics-600
Obtained 39.8 mAP on AVA
Abstract
We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Image Enhancement Techniques · Human Pose and Action Recognition
MethodsLocal Contrast Normalization
