One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide

TL;DR
This paper introduces a novel one-stage open-vocabulary temporal action detection method that uses multi-scale analysis and video-text alignment to improve detection accuracy across diverse action durations and categories.
Contribution
The paper proposes a one-stage approach with multi-scale video analysis and video-text alignment modules, advancing open-vocabulary TAD by addressing duration variability and label alignment challenges.
Findings
Achieved superior results on THUMOS14 and ActivityNet-1.3 datasets.
Outperformed existing methods in open-vocabulary and closed-vocabulary settings.
Demonstrated effectiveness in detecting actions of varying durations.
Abstract
Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach that expands Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) capabilities. Closed-vocab TAD is typically confined to localizing and classifying actions based on a predefined set of categories. In contrast, Open-vocab TAD goes further and is not limited to these predefined categories. This is particularly useful in real-world scenarios where the variety of actions in videos can be vast and not always predictable. The prevalent methods in Open-vocab TAD typically employ a 2-stage approach, which involves generating action proposals and then identifying those actions. However, errors made during the first stage can adversely affect the subsequent action identification accuracy. Additionally, existing studies face challenges in handling actions of different durations owing to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsSparse Evolutionary Training · ALIGN
