MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

Debashis Gupta; Aditi Golder; Rongkhun Zhu; Kangning Cui; Wei Tang; Fan Yang; Ovidiu Csillik; Sarra Alaqahtani; V. Paul Pauca

arXiv:2507.08683·cs.CV·July 14, 2025

MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing

Debashis Gupta, Aditi Golder, Rongkhun Zhu, Kangning Cui, Wei Tang, Fan Yang, Ovidiu Csillik, Sarra Alaqahtani, V. Paul Pauca

PDF

TL;DR

MoSAiC introduces a multi-modal, multi-label contrastive learning framework tailored for remote sensing, improving semantic disentanglement and robustness in satellite imagery analysis, especially under low-label and complex class conditions.

Contribution

It presents a novel unified contrastive learning approach that combines intra- and inter-modality supervision with multi-label alignment for satellite imagery.

Findings

01

Outperforms supervised and self-supervised baselines in accuracy

02

Enhances cluster coherence and semantic disentanglement

03

Improves generalization in low-label, high-overlap scenarios

Abstract

Contrastive learning (CL) has emerged as a powerful paradigm for learning transferable representations without the reliance on large labeled datasets. Its ability to capture intrinsic similarities and differences among data samples has led to state-of-the-art results in computer vision tasks. These strengths make CL particularly well-suited for Earth System Observation (ESO), where diverse satellite modalities such as optical and SAR imagery offer naturally aligned views of the same geospatial regions. However, ESO presents unique challenges, including high inter-class similarity, scene clutter, and ambiguous boundaries, which complicate representation learning -- especially in low-label, multi-label settings. Existing CL frameworks often focus on intra-modality self-supervision or lack mechanisms for multi-label alignment and semantic precision across modalities. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.