On the Surprising Effectiveness of Transformers in Low-Labeled Video   Recognition

Farrukh Rahman; \"Omer Mubarek; Zsolt Kira

arXiv:2209.07474·cs.CV·October 27, 2022·1 cites

On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

Farrukh Rahman, \"Omer Mubarek, Zsolt Kira

PDF

Open Access

TL;DR

This paper demonstrates that vision transformers outperform CNNs in low-labeled video classification, challenging prior assumptions about their data requirements and suggesting their broader applicability in semi-supervised settings.

Contribution

The study empirically shows that video transformers are highly effective in low-labeled data regimes, outperforming CNNs and semi-supervised methods, with thorough analysis and ablation studies.

Findings

01

Transformers perform well in low-labeled video classification.

02

Transformers outperform semi-supervised CNN methods on benchmark datasets.

03

Analysis explains the effectiveness of transformers in low-data regimes.

Abstract

Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition