Video Swin Transformer

Ze Liu; Jia Ning; Yue Cao; Yixuan Wei; Zheng Zhang; Stephen Lin; Han; Hu

arXiv:2106.13230·cs.CV·June 25, 2021·47 cites

Video Swin Transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han, Hu

PDF

Open Access 5 Repos 4 Models

TL;DR

This paper introduces a locality-based Video Swin Transformer architecture that improves speed and accuracy in video recognition tasks by adapting the Swin Transformer for video data, leveraging local attention instead of global self-attention.

Contribution

The paper proposes a novel Video Swin Transformer that incorporates locality bias, achieving state-of-the-art results with better efficiency compared to global attention models.

Findings

01

Achieves 84.9% top-1 accuracy on Kinetics-400

02

Uses ~20x less pre-training data and smaller models

03

Attains 69.6% top-1 accuracy on Something-Something v2

Abstract

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Anomaly Detection Techniques and Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Byte Pair Encoding · Adam · Layer Normalization · Dropout