Concatenated Masked Autoencoders as Spatial-Temporal Learner
Zhouqiang Jiang, Bowen Wang, Tong Xiang, Zhaofeng Niu, Hong Tang,, Guangshun Li, Liangzhi Li

TL;DR
CatMAE is a self-supervised video representation learning method that uses masked autoencoders with a novel spatial-temporal masking strategy and data augmentation, leading to improved performance in video segmentation and action recognition.
Contribution
The paper introduces CatMAE, a novel masked autoencoder framework that effectively captures spatial-temporal information in videos using extensive masking and a new data augmentation strategy.
Findings
Achieves state-of-the-art results in video segmentation.
Outperforms existing methods in action recognition.
Utilizes a new Video-Reverse augmentation to enhance learning.
Abstract
Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
