Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen Wei; Haoqi Fan; Saining Xie; Chao-Yuan Wu; Alan Yuille; Christoph; Feichtenhofer

arXiv:2112.09133·cs.CV·January 13, 2023·25 cites

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph, Feichtenhofer

PDF

Open Access 5 Repos

TL;DR

MaskFeat introduces a self-supervised pre-training method for video models that predicts masked features, notably using HOG descriptors, leading to state-of-the-art results on multiple video benchmarks without extra supervision.

Contribution

The paper proposes Masked Feature Prediction (MaskFeat), a novel self-supervised learning approach that predicts features of masked regions, demonstrating effectiveness with HOG features and large-scale Transformer models.

Findings

01

Achieved 86.7% on Kinetics-400 with MViT-L

02

Attained 88.3% on Kinetics-600

03

Obtained 39.8 mAP on AVA

Abstract

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Image Enhancement Techniques · Human Pose and Action Recognition

MethodsLocal Contrast Normalization