MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP

Aditya Chaudhary; Sneha Barman; Mainak Singha; Ankit Jha; Girish Mishra; Biplab Banerjee

arXiv:2601.08420·cs.CV·January 14, 2026

MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP

Aditya Chaudhary, Sneha Barman, Mainak Singha, Ankit Jha, Girish Mishra, Biplab Banerjee

PDF

Open Access

TL;DR

This paper introduces MMLGNet, a framework that aligns remote sensing data like hyperspectral images and LiDAR with natural language using CLIP, enabling semantic understanding and outperforming existing methods.

Contribution

MMLGNet is the first to effectively fuse remote sensing modalities with language semantics using CLIP-based contrastive learning.

Findings

01

MMLGNet outperforms existing multimodal methods on benchmark datasets.

02

Simple CNN encoders suffice for effective modality alignment.

03

Language supervision significantly improves remote sensing data interpretation.

Abstract

In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Remote-Sensing Image Classification · Domain Adaptation and Few-Shot Learning