Temporal Action Localization with Enhanced Instant Discriminability
Dingfeng Shi, Qiong Cao, Yujie Zhong, Shan An, Jian Cheng, Haogang, Zhu, Dacheng Tao

TL;DR
This paper introduces TriDet, a one-stage framework for temporal action detection that enhances boundary modeling and instant discriminability using a novel Trident-head, a scalable-granularity perception layer, and large pretrained models, achieving state-of-the-art results.
Contribution
The paper proposes a novel one-stage TAD framework with a Trident-head, an SGP layer, and the integration of large pretrained models to improve boundary detection and discriminability.
Findings
TriDet achieves state-of-the-art performance on multiple TAD datasets.
The SGP layer effectively mitigates rank-loss in transformer-based methods.
Large pretrained models enhance the representation capability for TAD.
Abstract
Temporal action detection (TAD) aims to detect all action boundaries and their corresponding categories in an untrimmed video. The unclear boundaries of actions in videos often result in imprecise predictions of action boundaries by existing methods. To resolve this issue, we propose a one-stage framework named TriDet. First, we propose a Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. Then, we analyze the rank-loss problem (i.e. instant discriminability deterioration) in transformer-based methods and propose an efficient scalable-granularity perception (SGP) layer to mitigate this issue. To further push the limit of instant discriminability in the video backbone, we leverage the strong representation capability of pretrained large models and investigate their performance on TAD. Last, considering the adequate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Multimodal Machine Learning Applications
