Concatenated Masked Autoencoders as Spatial-Temporal Learner

Zhouqiang Jiang; Bowen Wang; Tong Xiang; Zhaofeng Niu; Hong Tang,; Guangshun Li; Liangzhi Li

arXiv:2311.00961·cs.CV·December 15, 2023·1 cites

Concatenated Masked Autoencoders as Spatial-Temporal Learner

Zhouqiang Jiang, Bowen Wang, Tong Xiang, Zhaofeng Niu, Hong Tang,, Guangshun Li, Liangzhi Li

PDF

Open Access 1 Repo

TL;DR

CatMAE is a self-supervised video representation learning method that uses masked autoencoders with a novel spatial-temporal masking strategy and data augmentation, leading to improved performance in video segmentation and action recognition.

Contribution

The paper introduces CatMAE, a novel masked autoencoder framework that effectively captures spatial-temporal information in videos using extensive masking and a new data augmentation strategy.

Findings

01

Achieves state-of-the-art results in video segmentation.

02

Outperforms existing methods in action recognition.

03

Utilizes a new Video-Reverse augmentation to enhance learning.

Abstract

Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

minhoooo1/catmae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis