TL;DR
ViViT introduces pure-transformer models for video classification that efficiently handle long sequences, leverage pretraining, and achieve state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes novel efficient transformer variants for video classification, incorporating regularization and pretraining techniques to improve performance on smaller datasets.
Findings
Achieved state-of-the-art results on Kinetics 400 and 600.
Developed efficient transformer variants for spatio-temporal modeling.
Demonstrated effective training with limited data using regularization and pretraining.
Abstract
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗keras-io/video-vision-transformermodel· 20 dl· ♡ 920 dl♡ 9
- 🤗pablorodriper/video-vision-transformermodel· ♡ 1♡ 1
- 🤗google/vivit-b-16x2model· 15k dl· ♡ 1115k dl♡ 11
- 🤗google/vivit-b-16x2-kinetics400model· 26k dl· ♡ 3826k dl♡ 38
- 🤗Neleac/SpaceTimeGPTmodel· 86 dl· ♡ 3286 dl♡ 32
- 🤗google/videoprism-base-f16r288model· 17k dl· ♡ 9817k dl♡ 98
- 🤗google/videoprism-large-f8r288model· 813 dl· ♡ 18813 dl♡ 18
- 🤗google/videoprism-lvt-base-f16r288model· 17k dl· ♡ 1117k dl♡ 11
- 🤗google/videoprism-lvt-large-f8r288model· 2.5k dl· ♡ 152.5k dl♡ 15
- 🤗fcxfcx/owlv2model· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
