Open-Vocabulary Spatio-Temporal Action Detection

Tao Wu; Shuqiu Ge; Jie Qin; Gangshan Wu; Limin Wang

arXiv:2405.10832·cs.CV·May 20, 2024

Open-Vocabulary Spatio-Temporal Action Detection

Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, Limin Wang

PDF

Open Access

TL;DR

This paper introduces open-vocabulary spatio-temporal action detection, enabling models to recognize new actions without extensive re-training, by leveraging fine-tuned video-language models and new benchmarks.

Contribution

It proposes a new open-vocabulary STAD setting, creates benchmarks, and develops a fine-tuning approach for pretrained video-language models to improve novel action detection.

Findings

01

Achieves promising performance on novel classes

02

Effective fine-tuning improves motion understanding

03

Fusion of local and global features enhances detection accuracy

Abstract

Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Balanced Selection