RCMCL: A Unified Contrastive Learning Framework for Robust Multi-Modal (RGB-D, Skeleton, Point Cloud) Action Understanding

Hasan Akgul; Mari Eplik; Javier Rojas; Akira Yamamoto; Rajesh Kumar; and Maya Singh

arXiv:2511.04351·eess.SP·November 18, 2025

RCMCL: A Unified Contrastive Learning Framework for Robust Multi-Modal (RGB-D, Skeleton, Point Cloud) Action Understanding

Hasan Akgul, Mari Eplik, Javier Rojas, Akira Yamamoto, Rajesh Kumar, and Maya Singh

PDF

Open Access

TL;DR

This paper introduces RCMCL, a self-supervised framework for multi-modal human action recognition that learns robust, modality-invariant representations and maintains high accuracy even with sensor noise or dropout.

Contribution

RCMCL is the first unified contrastive learning framework that enhances robustness and view-invariance in multi-modal HAR through cross-modal alignment, self-distillation, and degradation simulation.

Findings

01

Achieves state-of-the-art accuracy on NTU RGB+D 120 and UWA3D-II datasets.

02

Demonstrates only 11.5% performance degradation under severe modality dropout.

03

Outperforms supervised baselines in robustness tests.

Abstract

Human action recognition (HAR) with multi-modal inputs (RGB-D, skeleton, point cloud) can achieve high accuracy but typically relies on large labeled datasets and degrades sharply when sensors fail or are noisy. We present Robust Cross-Modal Contrastive Learning (RCMCL), a self-supervised framework that learns modality-invariant representations and remains reliable under modality dropout and corruption. RCMCL jointly optimizes (i) a cross-modal contrastive objective that aligns heterogeneous streams, (ii) an intra-modal self-distillation objective that improves view-invariance and reduces redundancy, and (iii) a degradation simulation objective that explicitly trains models to recover from masked or corrupted inputs. At inference, an Adaptive Modality Gating (AMG) network assigns data-driven reliability weights to each modality for robust fusion. On NTU RGB+D 120 (CS/CV) and UWA3D-II,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Context-Aware Activity Recognition Systems · Robot Manipulation and Learning