Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Zeyu Liang; Hailun Xia; Naichuan Zheng

arXiv:2512.21916·cs.CV·December 29, 2025

Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

Zeyu Liang, Hailun Xia, Naichuan Zheng

PDF

Open Access

TL;DR

This paper introduces PAN, a human-centric graph learning framework that effectively fuses RGB and skeleton data for multimodal action recognition, achieving state-of-the-art results.

Contribution

The paper proposes a novel human-centric graph modeling paradigm and two variants, PAN-Ensemble and PAN-Unified, for improved multimodal action recognition.

Findings

01

Achieves state-of-the-art performance on three datasets.

02

Effectively fuses RGB and skeleton modalities.

03

Reduces dependency on high-quality skeletal data.

Abstract

While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Graph Neural Networks · Advanced Technologies in Various Fields