Towards High-Quality Temporal Action Detection with Sparse Proposals
Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma,, Ping Luo

TL;DR
This paper introduces SP-TAD, a sparse proposal-based Transformer method for temporal action detection that effectively handles variable action durations and ambiguous boundaries, achieving state-of-the-art results on THUMOS14.
Contribution
The paper proposes Sparse Proposals within a Transformer framework to improve multi-scale feature utilization and boundary precision in temporal action detection.
Findings
Achieves state-of-the-art performance on THUMOS14 at high tIoU thresholds.
Effectively handles large variance in action durations.
Utilizes local segment interactions to preserve action details.
Abstract
Temporal Action Detection (TAD) is an essential and challenging topic in video understanding, aiming to localize the temporal segments containing human action instances and predict the action categories. The previous works greatly rely upon dense candidates either by designing varying anchors or enumerating all the combinations of boundaries on video sequences; therefore, they are related to complicated pipelines and sensitive hand-crafted designs. Recently, with the resurgence of Transformer, query-based methods have tended to become the rising solutions for their simplicity and flexibility. However, there still exists a performance gap between query-based methods and well-established methods. In this paper, we identify the main challenge lies in the large variants of action duration and the ambiguous boundaries for short action instances; nevertheless, quadratic-computational global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections · Label Smoothing · Multi-Head Attention · Byte Pair Encoding · Softmax
