Unified Framework with Consistency across Modalities for Human Activity Recognition
Tuyen Tran, Thao Minh Le, Hung Tran, Truyen Tran

TL;DR
This paper introduces a novel multimodal framework with a unique compositional query machine and a consistency loss to improve human activity recognition in videos by effectively leveraging multiple input modalities.
Contribution
The paper presents a new neural architecture called COMPUTER that models interactions across modalities and enforces prediction consistency, advancing multimodal human activity recognition.
Findings
Achieves superior performance on action localization tasks
Effectively leverages complementary information across modalities
Demonstrates robustness in group activity recognition
Abstract
Recognizing human activities in videos is challenging due to the spatio-temporal complexity and context-dependence of human interactions. Prior studies often rely on single input modalities, such as RGB or skeletal data, limiting their ability to exploit the complementary advantages across modalities. Recent studies focus on combining these two modalities using simple feature fusion techniques. However, due to the inherent disparities in representation between these input modalities, designing a unified neural network architecture to effectively leverage their complementary information remains a significant challenge. To address this, we propose a comprehensive multimodal framework for robust video-based human activity recognition. Our key contribution is the introduction of a novel compositional query machine, called COMPUTER ($\textbf{COMP}ositional h\textbf{U}man-cen\textbf{T}ric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Human Pose and Action Recognition
MethodsFocus
