Motion-Guided Masking for Spatiotemporal Representation Learning

David Fan; Jue Wang; Shuai Liao; Yi Zhu; Vimal Bhat; Hector; Santos-Villalobos; Rohith MV; Xinyu Li

arXiv:2308.12962·cs.CV·August 25, 2023

Motion-Guided Masking for Spatiotemporal Representation Learning

David Fan, Jue Wang, Shuai Liao, Yi Zhu, Vimal Bhat, Hector, Santos-Villalobos, Rohith MV, Xinyu Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a motion-guided masking algorithm for video masked autoencoders that leverages motion vectors from compressed videos, improving efficiency and performance in video understanding tasks.

Contribution

The novel MGM algorithm uses motion vectors for masking in video MAE, enhancing efficiency and accuracy over random masking strategies.

Findings

01

Achieves up to +1.3% accuracy on Kinetics-400 and Something-Something V2.

02

Reduces training epochs by up to 66% while maintaining performance.

03

Improves downstream transfer learning and domain adaptation results.

Abstract

Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Motion-Guided Masking for Spatiotemporal Representation Learning· youtube

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsMasked autoencoder