Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition
Zeyu Liang, Hailun Xia, Naichuan Zheng

TL;DR
This paper introduces PAN, a human-centric graph learning framework that effectively fuses RGB and skeleton data for multimodal action recognition, achieving state-of-the-art results.
Contribution
The paper proposes a novel human-centric graph modeling paradigm and two variants, PAN-Ensemble and PAN-Unified, for improved multimodal action recognition.
Findings
Achieves state-of-the-art performance on three datasets.
Effectively fuses RGB and skeleton modalities.
Reduces dependency on high-quality skeletal data.
Abstract
While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Graph Neural Networks · Advanced Technologies in Various Fields
