Recurrent Video Masked Autoencoders

Daniel Zoran; Nikhil Parthasarathy; Yi Yang; Drew A Hudson; Joao Carreira; Andrew Zisserman

arXiv:2512.13684·cs.CV·April 22, 2026

Recurrent Video Masked Autoencoders

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

PDF

1 Models

TL;DR

Recurrent Video Masked Autoencoders (RVM) introduce a recurrent transformer-based model for efficient, long-term video representation learning, achieving competitive performance with state-of-the-art models on various video tasks.

Contribution

RVM is a novel recurrent transformer-based video encoder that improves efficiency and long-term feature propagation without requiring knowledge distillation.

Findings

01

RVM matches or exceeds state-of-the-art on action classification and tracking.

02

RVM achieves up to 30x greater parameter efficiency than competing models.

03

RVM effectively propagates features over long temporal horizons with linear cost.

Abstract

We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
tue-mps/towards-video-image-frozen
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.