Adaptive Perception Transformer for Temporal Action Localization
Yizheng Ouyang, Tianjin Zhang, Weibo Gu, and Hongfa Wang

TL;DR
This paper introduces AdaPerFormer, an end-to-end adaptive perception transformer that effectively models global and local contexts for accurate temporal action localization in videos.
Contribution
The paper proposes a novel dual-branch attention mechanism within an end-to-end transformer framework for improved action boundary and category prediction.
Findings
Achieves competitive results on THUMOS14 dataset
Effectively models global and local video contexts
Demonstrates the benefits of end-to-end design
Abstract
Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a end-to-end model, called Adaptive Perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
