bi-modal textual prompt learning for vision-language models in remote sensing

Pankhi Kashyap; Mainak Singha; Biplab Banerjee

arXiv:2601.20675·cs.CV·January 29, 2026

bi-modal textual prompt learning for vision-language models in remote sensing

Pankhi Kashyap, Mainak Singha, Biplab Banerjee

PDF

Open Access

TL;DR

This paper introduces BiMoRS, a bi-modal prompt learning framework that enhances vision-language models for remote sensing tasks by leveraging textual summaries from image captioning to improve generalization and performance.

Contribution

BiMoRS is a novel lightweight bi-modal prompt learning method that fuses textual and visual features for remote sensing, improving transferability and accuracy without modifying the core model.

Findings

01

Outperforms strong baselines by up to 2% on average across datasets.

02

Effective in handling multi-label and high intra-class variability RS data.

03

Demonstrates improved domain generalization in remote sensing tasks.

Abstract

Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications