Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

Shuai Yang; Ziyue Huang; Jiaxin Chen; Qingjie Liu; Yunhong Wang

arXiv:2602.01954·cs.CV·February 3, 2026

Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang

PDF

Open Access

TL;DR

This paper introduces RS-MPOD, a multimodal framework for remote sensing object detection that enhances category specification through visual prompts and multimodal fusion, outperforming text-only methods especially under semantic ambiguity.

Contribution

It proposes a novel multimodal prompting approach incorporating visual prompts and fusion modules, addressing limitations of text-only prompts in remote sensing object detection.

Findings

01

Visual prompting improves robustness under semantic ambiguity.

02

Multimodal prompting remains competitive with well-aligned textual semantics.

03

Extensive experiments validate the effectiveness across various benchmarks.

Abstract

Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote-Sensing Image Classification · Multimodal Machine Learning Applications · Advanced Neural Network Applications