Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN
Cheng-Bin Jin, Shengzhe Li, and Hakil Kim

TL;DR
This paper introduces a detailed sub-action descriptor for real-time multi-person action detection in video surveillance, utilizing multi-CNNs to improve accuracy and speed on large-scale datasets.
Contribution
It proposes a novel sub-action descriptor with three levels and a multi-CNN based detection model, enhancing recognition detail and real-time performance.
Findings
Achieved 76.6% mAP on ICVL dataset
Outperformed state-of-the-art on KTH dataset
Operates at 25 fps on ICVL and 80 fps on KTH
Abstract
When we say a person is texting, can you tell the person is walking or sitting? Emphatically, no. In order to solve this incomplete representation problem, this paper presents a sub-action descriptor for detailed action detection. The sub-action descriptor consists of three levels: the posture, the locomotion, and the gesture level. The three levels give three sub-action categories for one action to address the representation problem. The proposed action detection model simultaneously localizes and recognizes the actions of multiple individuals in video surveillance using appearance-based temporal features with multi-CNN. The proposed approach achieved a mean average precision (mAP) of 76.6% at the frame-based and 83.5% at the video-based measurement on the new large-scale ICVL video surveillance dataset that the authors introduce and make available to the community with this paper.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Hand Gesture Recognition Systems
