DualFormer: Local-Global Stratified Transformer for Efficient Video   Recognition

Yuxuan Liang; Pan Zhou; Roger Zimmermann; Shuicheng Yan

arXiv:2112.04674·cs.CV·November 23, 2022·1 cites

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

Yuxuan Liang, Pan Zhou, Roger Zimmermann, Shuicheng Yan

PDF

Open Access 1 Repo

TL;DR

DualFormer is a novel transformer architecture that efficiently captures local and global spatiotemporal dependencies in video recognition, significantly reducing computational costs while maintaining high accuracy.

Contribution

It introduces a dual-level stratification of space-time attention, combining local and global dependencies, which improves efficiency and effectiveness over existing methods.

Findings

01

Achieves 82.9% top-1 accuracy on Kinetics-400 with ~1000G FLOPs

02

Outperforms existing methods with at least 3.2x fewer FLOPs

03

Verifies superior performance on five video benchmarks

Abstract

While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sail-sg/dualformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis