Dual DETRs for Multi-Label Temporal Action Detection
Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, Limin Wang

TL;DR
This paper introduces DualDETR, a dual-level query-based framework for multi-label temporal action detection that improves boundary localization by capturing both instance and boundary semantics through a two-branch decoding structure.
Contribution
The paper proposes a novel dual-level query framework with a joint initialization strategy, enhancing temporal boundary detection in multi-label TAD tasks beyond existing methods.
Findings
Achieves superior performance on three multi-label TAD benchmarks.
Significantly improves det-mAP over state-of-the-art methods.
Demonstrates effective boundary localization through dual-level decoding.
Abstract
Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Analysis and Summarization
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Feedforward Network · Absolute Position Encodings · Softmax · Convolution
