Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection
Mohammadreza Zolfaghari, Gabriel L. Oliveira, Nima Sedaghat, and, Thomas Brox

TL;DR
This paper introduces a multi-stream network architecture that effectively combines pose, motion, and appearance cues using a Markov chain model, achieving state-of-the-art results in action recognition and localization.
Contribution
The novel integration of pose, motion, and appearance cues via a Markov chain model enhances action recognition and localization performance.
Findings
Achieves state-of-the-art accuracy on HMDB51, J-HMDB, and NTU RGB+D datasets.
Yields top results in spatio-temporal localization on UCF101 and J-HMDB.
Efficient approach applicable to both classification and localization tasks.
Abstract
General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
