ViViT: A Video Vision Transformer

Anurag Arnab; Mostafa Dehghani; Georg Heigold; Chen Sun; Mario; Lu\v{c}i\'c; Cordelia Schmid

arXiv:2103.15691·cs.CV·November 2, 2021

ViViT: A Video Vision Transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario, Lu\v{c}i\'c, Cordelia Schmid

PDF

5 Repos 10 Models

TL;DR

ViViT introduces pure-transformer models for video classification that efficiently handle long sequences, leverage pretraining, and achieve state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes novel efficient transformer variants for video classification, incorporating regularization and pretraining techniques to improve performance on smaller datasets.

Findings

01

Achieved state-of-the-art results on Kinetics 400 and 600.

02

Developed efficient transformer variants for spatio-temporal modeling.

03

Demonstrated effective training with limited data using regularization and pretraining.

Abstract

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.