Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Xu Zhang; Jiabin Fang; Zhuoming Ding; Jin Yuan; Xuan Liu; Qianjun Zhang; Zhiyong Li

arXiv:2512.11680·cs.CV·December 15, 2025

Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li

PDF

Open Access

TL;DR

This paper introduces CLV-Net, a novel model for multimodal remote sensing image understanding that uses visual cues and inter-object relationship modeling to improve segmentation and captioning accuracy.

Contribution

The paper proposes a cross-modal, context-aware learning framework with a novel decoder and alignment modules to enhance user-guided multimodal image understanding in remote sensing.

Findings

01

Outperforms existing methods on benchmark datasets

02

Achieves state-of-the-art segmentation and captioning results

03

Effectively captures user intent and inter-object relationships

Abstract

Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications