Multi-Stream Single Shot Spatial-Temporal Action Detection
Pengfei Zhang, Yu Cao, Benyuan Liu

TL;DR
This paper introduces a novel 3D CNN-based single shot detector for spatial-temporal action detection, combining short-term and long-term streams to improve accuracy in video analysis.
Contribution
It is the first system to integrate 3D CNNs with SSD for action detection, leveraging multiple streams for enhanced spatial-temporal understanding.
Findings
Achieves 71.30% frame-mAP on UCF101-24 dataset
First to combine 3D CNN and SSD in action detection
Outperforms previous one-stage methods
Abstract
We present a 3D Convolutional Neural Networks (CNNs) based single shot detector for spatial-temporal action detection tasks. Our model includes: (1) two short-term appearance and motion streams, with single RGB and optical flow image input separately, in order to capture the spatial and temporal information for the current frame; (2) two long-term 3D ConvNet based stream, working on sequences of continuous RGB and optical flow images to capture the context from past frames. Our model achieves strong performance for action detection in video and can be easily integrated into any current two-stream action detection methods. We report a frame-mAP of 71.30% on the challenging UCF101-24 actions dataset, achieving the state-of-the-art result of the one-stage methods. To the best of our knowledge, our work is the first system that combined 3D CNN and SSD in action detection tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
