Multi-Modal Three-Stream Network for Action Recognition
Muhammad Usman Khalid, Jie Yu

TL;DR
This paper introduces a multi-modal three-stream network for human action recognition in videos, integrating pose features with appearance and motion cues, achieving superior performance on benchmark datasets.
Contribution
It proposes a novel three-stream framework that fuses appearance, motion, and pose cues, including noisy estimated poses, to improve action recognition accuracy.
Findings
Outperforms state-of-the-art on JHMDB, sub-JHMDB, and Penn Action datasets.
Effectively incorporates noisy pose estimates in the recognition framework.
Demonstrates the strength of combining complementary cues for complex video understanding.
Abstract
Human action recognition in video is an active yet challenging research topic due to high variation and complexity of data. In this paper, a novel video based action recognition framework utilizing complementary cues is proposed to handle this complex problem. Inspired by the successful two stream networks for action classification, additional pose features are studied and fused to enhance understanding of human action in a more abstract and semantic way. Towards practices, not only ground truth poses but also noisy estimated poses are incorporated in the framework with our proposed pre-processing module. The whole framework and each cue are evaluated on varied benchmarking datasets as JHMDB, sub-JHMDB and Penn Action. Our results outperform state-of-the-art performance on these datasets and show the strength of complementary cues.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
