Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
Hongsong Wang, Heng Fei, Bingxuan Dai, Jie Gui

TL;DR
This paper introduces a self-supervised framework for multimodal skeleton-based action recognition that balances efficiency and performance by decomposing and composing features, outperforming simple fusion methods.
Contribution
It proposes a novel Decomposition and Composition framework that effectively utilizes multimodal features for action recognition with improved efficiency and accuracy.
Findings
Achieves high accuracy on NTU RGB+D datasets
Reduces computational cost compared to late fusion methods
Enhances multimodal feature learning through self-supervision
Abstract
Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
