A Survey on Deep Learning-based Spatio-temporal Action Detection
Peng Wang, Fanwei Zeng, Yuntao Qian

TL;DR
This paper reviews recent deep learning methods for spatio-temporal action detection in videos, discussing their taxonomy, linking algorithms, datasets, evaluation metrics, and future research directions.
Contribution
It provides a comprehensive taxonomy and comparison of state-of-the-art deep learning approaches for STAD, highlighting current challenges and future research directions.
Findings
Performance benchmarks of leading models are summarized.
Linking algorithms effectively associate detection results over time.
Potential research directions are discussed for advancing STAD.
Abstract
Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance, entertainment, etc. Many efforts have been devoted in recent years to building a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. Firstly, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications
