On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition
Farrukh Rahman, \"Omer Mubarek, Zsolt Kira

TL;DR
This paper demonstrates that vision transformers outperform CNNs in low-labeled video classification, challenging prior assumptions about their data requirements and suggesting their broader applicability in semi-supervised settings.
Contribution
The study empirically shows that video transformers are highly effective in low-labeled data regimes, outperforming CNNs and semi-supervised methods, with thorough analysis and ablation studies.
Findings
Transformers perform well in low-labeled video classification.
Transformers outperform semi-supervised CNN methods on benchmark datasets.
Analysis explains the effectiveness of transformers in low-data regimes.
Abstract
Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
