An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Yuetian Weng; Zizheng Pan; Mingfei Han; Xiaojun Chang; Bohan Zhuang

arXiv:2207.10448·cs.CV·July 22, 2022·1 cites

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang

PDF

Open Access

TL;DR

This paper introduces an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) that balances accuracy and computational efficiency for action detection in videos by combining local and global attention mechanisms.

Contribution

The paper proposes a novel hierarchical Transformer architecture that uses local window attention in early layers and global attention later, improving efficiency without sacrificing accuracy.

Findings

01

Achieves 53.6% mAP on THUMOS14 with only RGB input.

02

Outperforms I3D+AFSD RGB model by over 10%.

03

Uses 31% fewer GFLOPs than state-of-the-art AFSD.

Abstract

The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Anomaly Detection Techniques and Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Residual Connection