DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
Yuxuan Liang, Pan Zhou, Roger Zimmermann, Shuicheng Yan

TL;DR
DualFormer is a novel transformer architecture that efficiently captures local and global spatiotemporal dependencies in video recognition, significantly reducing computational costs while maintaining high accuracy.
Contribution
It introduces a dual-level stratification of space-time attention, combining local and global dependencies, which improves efficiency and effectiveness over existing methods.
Findings
Achieves 82.9% top-1 accuracy on Kinetics-400 with ~1000G FLOPs
Outperforms existing methods with at least 3.2x fewer FLOPs
Verifies superior performance on five video benchmarks
Abstract
While transformers have shown great potential on video recognition with their strong capability of capturing long-range dependencies, they often suffer high computational costs induced by the self-attention to the huge number of 3D tokens. In this paper, we present a new transformer architecture termed DualFormer, which can efficiently perform space-time attention for video recognition. Concretely, DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local interactions among nearby 3D tokens, and then to capture coarse-grained global dependencies between the query token and global pyramid contexts. Different from existing methods that apply space-time factorization or restrict attention computations within local windows for improving efficiency, our local-global stratification strategy can well capture both short- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
