RCMCL: A Unified Contrastive Learning Framework for Robust Multi-Modal (RGB-D, Skeleton, Point Cloud) Action Understanding
Hasan Akgul, Mari Eplik, Javier Rojas, Akira Yamamoto, Rajesh Kumar, and Maya Singh

TL;DR
This paper introduces RCMCL, a self-supervised framework for multi-modal human action recognition that learns robust, modality-invariant representations and maintains high accuracy even with sensor noise or dropout.
Contribution
RCMCL is the first unified contrastive learning framework that enhances robustness and view-invariance in multi-modal HAR through cross-modal alignment, self-distillation, and degradation simulation.
Findings
Achieves state-of-the-art accuracy on NTU RGB+D 120 and UWA3D-II datasets.
Demonstrates only 11.5% performance degradation under severe modality dropout.
Outperforms supervised baselines in robustness tests.
Abstract
Human action recognition (HAR) with multi-modal inputs (RGB-D, skeleton, point cloud) can achieve high accuracy but typically relies on large labeled datasets and degrades sharply when sensors fail or are noisy. We present Robust Cross-Modal Contrastive Learning (RCMCL), a self-supervised framework that learns modality-invariant representations and remains reliable under modality dropout and corruption. RCMCL jointly optimizes (i) a cross-modal contrastive objective that aligns heterogeneous streams, (ii) an intra-modal self-distillation objective that improves view-invariance and reduces redundancy, and (iii) a degradation simulation objective that explicitly trains models to recover from masked or corrupted inputs. At inference, an Adaptive Modality Gating (AMG) network assigns data-driven reliability weights to each modality for robust fusion. On NTU RGB+D 120 (CS/CV) and UWA3D-II,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Robot Manipulation and Learning
