It Takes Two: Masked Appearance-Motion Modeling for Self-supervised   Video Transformer Pre-training

Yuxin Song; Min Yang; Wenhao Wu; Dongliang He; Fu Li; Jingdong Wang

arXiv:2210.05234·cs.CV·October 12, 2022·5 cites

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang

PDF

Open Access

TL;DR

This paper introduces MAM2, a self-supervised video pre-training framework that explicitly incorporates motion cues through a novel mask-and-predict approach, significantly improving training efficiency and representation quality.

Contribution

The work proposes a new Masked Appearance-Motion Modeling framework with a specialized encoder-regressor-decoder architecture for enhanced video representation learning.

Findings

01

Speeds up convergence, requiring half the epochs of previous methods.

02

Achieves state-of-the-art performance on multiple video benchmarks.

03

Effectively utilizes RGB-difference for motion prediction.

Abstract

Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline. They have demonstrated outstanding effectiveness on downstream video tasks and superior data efficiency on small datasets. However, temporal relation is not fully exploited by these methods. In this work, we explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling (MAM2) framework. Specifically, we design an encoder-regressor-decoder pipeline for this task. The regressor separates feature encoding and pretext tasks completion, such that the feature extraction process is completed adequately by the encoder. In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction. We explore various motion prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Human Pose and Action Recognition · Video Surveillance and Tracking Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings