ARVideo: Autoregressive Pretraining for Self-Supervised Video   Representation Learning

Sucheng Ren; Hongru Zhu; Chen Wei; Yijiang Li; Alan Yuille; Cihang Xie

arXiv:2405.15160·cs.CV·May 27, 2024

ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, Cihang Xie

PDF

Open Access

TL;DR

ARVideo introduces a novel autoregressive self-supervised learning framework for videos, leveraging spatiotemporal token clustering and randomized prediction order to improve efficiency and performance.

Contribution

It proposes a new autoregressive video pretraining method with spatiotemporal clustering and randomized prediction order, enhancing contextual learning and training efficiency.

Findings

01

Achieves 81.2% on Kinetics-400 with ViT-B backbone.

02

Attains 70.9% on Something-Something V2.

03

Trains 14% faster and uses 58% less GPU memory than VideoMAE.

Abstract

This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both spatially and temporally, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2% on Kinetics-400 and 70.9% on Something-Something V2,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training