Open-Vocabulary Spatio-Temporal Action Detection
Tao Wu, Shuqiu Ge, Jie Qin, Gangshan Wu, Limin Wang

TL;DR
This paper introduces open-vocabulary spatio-temporal action detection, enabling models to recognize new actions without extensive re-training, by leveraging fine-tuned video-language models and new benchmarks.
Contribution
It proposes a new open-vocabulary STAD setting, creates benchmarks, and develops a fine-tuning approach for pretrained video-language models to improve novel action detection.
Findings
Achieves promising performance on novel classes
Effective fine-tuning improves motion understanding
Fusion of local and global features enhances detection accuracy
Abstract
Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Balanced Selection
