TL;DR
Recurrent Video Masked Autoencoders (RVM) introduce a recurrent transformer-based model for efficient, long-term video representation learning, achieving competitive performance with state-of-the-art models on various video tasks.
Contribution
RVM is a novel recurrent transformer-based video encoder that improves efficiency and long-term feature propagation without requiring knowledge distillation.
Findings
RVM matches or exceeds state-of-the-art on action classification and tracking.
RVM achieves up to 30x greater parameter efficiency than competing models.
RVM effectively propagates features over long temporal horizons with linear cost.
Abstract
We present Recurrent Video Masked-Autoencoders (RVM): a novel approach to video representation learning that leverages recurrent computation to model the temporal structure of video data. RVM couples an asymmetric masking objective with a transformer-based recurrent neural network to aggregate information over time, training solely on a simple pixel reconstruction loss. This design yields a highly efficient "generalist" encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action classification, and point and object tracking, while matching or exceeding the performance of image models (e.g. DINOv2) on tasks that require strong geometric and dense spatial features. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
