Multi-Stream Single Shot Spatial-Temporal Action Detection

Pengfei Zhang; Yu Cao; Benyuan Liu

arXiv:1908.08178·cs.CV·August 23, 2019

Multi-Stream Single Shot Spatial-Temporal Action Detection

Pengfei Zhang, Yu Cao, Benyuan Liu

PDF

Open Access

TL;DR

This paper introduces a novel 3D CNN-based single shot detector for spatial-temporal action detection, combining short-term and long-term streams to improve accuracy in video analysis.

Contribution

It is the first system to integrate 3D CNNs with SSD for action detection, leveraging multiple streams for enhanced spatial-temporal understanding.

Findings

01

Achieves 71.30% frame-mAP on UCF101-24 dataset

02

First to combine 3D CNN and SSD in action detection

03

Outperforms previous one-stage methods

Abstract

We present a 3D Convolutional Neural Networks (CNNs) based single shot detector for spatial-temporal action detection tasks. Our model includes: (1) two short-term appearance and motion streams, with single RGB and optical flow image input separately, in order to capture the spatial and temporal information for the current frame; (2) two long-term 3D ConvNet based stream, working on sequences of continuous RGB and optical flow images to capture the context from past frames. Our model achieves strong performance for action detection in video and can be easily integrated into any current two-stream action detection methods. We report a frame-mAP of 71.30% on the challenging UCF101-24 actions dataset, achieving the state-of-the-art result of the one-stage methods. To the best of our knowledge, our work is the first system that combined 3D CNN and SSD in action detection tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD