One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal   Multi-scale and Action Label Features

Trung Thanh Nguyen; Yasutomo Kawanishi; Takahiro Komamizu; Ichiro Ide

arXiv:2404.19542·cs.CV·May 1, 2024

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features

Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel one-stage open-vocabulary temporal action detection method that uses multi-scale analysis and video-text alignment to improve detection accuracy across diverse action durations and categories.

Contribution

The paper proposes a one-stage approach with multi-scale video analysis and video-text alignment modules, advancing open-vocabulary TAD by addressing duration variability and label alignment challenges.

Findings

01

Achieved superior results on THUMOS14 and ActivityNet-1.3 datasets.

02

Outperformed existing methods in open-vocabulary and closed-vocabulary settings.

03

Demonstrated effectiveness in detecting actions of varying durations.

Abstract

Open-vocabulary Temporal Action Detection (Open-vocab TAD) is an advanced video analysis approach that expands Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) capabilities. Closed-vocab TAD is typically confined to localizing and classifying actions based on a predefined set of categories. In contrast, Open-vocab TAD goes further and is not limited to these predefined categories. This is particularly useful in real-world scenarios where the variety of actions in videos can be vast and not always predictable. The prevalent methods in Open-vocab TAD typically employ a 2-stage approach, which involves generating action proposals and then identifying those actions. However, errors made during the first stage can adversely affect the subsequent action identification accuracy. Additionally, existing studies face challenges in handling actions of different durations owing to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thanhhff/HOTAD
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsSparse Evolutionary Training · ALIGN