Video Swin Transformer
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han, Hu

TL;DR
This paper introduces a locality-based Video Swin Transformer architecture that improves speed and accuracy in video recognition tasks by adapting the Swin Transformer for video data, leveraging local attention instead of global self-attention.
Contribution
The paper proposes a novel Video Swin Transformer that incorporates locality bias, achieving state-of-the-art results with better efficiency compared to global attention models.
Findings
Achieves 84.9% top-1 accuracy on Kinetics-400
Uses ~20x less pre-training data and smaller models
Attains 69.6% top-1 accuracy on Something-Something v2
Abstract
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Anomaly Detection Techniques and Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Byte Pair Encoding · Adam · Layer Normalization · Dropout
