An Efficient Spatio-Temporal Pyramid Transformer for Action Detection
Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang

TL;DR
This paper introduces an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) that balances accuracy and computational efficiency for action detection in videos by combining local and global attention mechanisms.
Contribution
The paper proposes a novel hierarchical Transformer architecture that uses local window attention in early layers and global attention later, improving efficiency without sacrificing accuracy.
Findings
Achieves 53.6% mAP on THUMOS14 with only RGB input.
Outperforms I3D+AFSD RGB model by over 10%.
Uses 31% fewer GFLOPs than state-of-the-art AFSD.
Abstract
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Residual Connection
