TL;DR
This paper introduces a novel single shot temporal action detection network that directly detects actions in untrimmed videos, improving accuracy over existing methods by eliminating proposal generation.
Contribution
The paper proposes the SSAD network that directly detects actions using 1D temporal convolutions and explores architecture and feature fusion strategies for optimal performance.
Findings
SSAD outperforms state-of-the-art methods on THUMOS 2014 and MEXaction2 datasets.
Achieved significant mAP improvements at IoU 0.5.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting start time and end time of each action instance. Many state-of-the-art methods adopt the "detection by classification" framework: first do proposal, and then classify proposals. The main drawback of this framework is that the boundaries of action instance proposals have been fixed during the classification step. To address this issue, we propose a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video. On pursuit of designing a particular SSAD network that can work effectively for temporal action detection, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
