Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Hongsong Wang; Heng Fei; Bingxuan Dai; Jie Gui

arXiv:2512.21064·cs.CV·March 11, 2026

Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui

PDF

Open Access

TL;DR

This paper introduces a self-supervised framework for multimodal skeleton-based action recognition that balances efficiency and performance by decomposing and composing features, outperforming simple fusion methods.

Contribution

It proposes a novel Decomposition and Composition framework that effectively utilizes multimodal features for action recognition with improved efficiency and accuracy.

Findings

01

Achieves high accuracy on NTU RGB+D datasets

02

Reduces computational cost compared to late fusion methods

03

Enhances multimodal feature learning through self-supervision

Abstract

Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)