Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang

TL;DR
This paper introduces RS-MPOD, a multimodal framework for remote sensing object detection that enhances category specification through visual prompts and multimodal fusion, outperforming text-only methods especially under semantic ambiguity.
Contribution
It proposes a novel multimodal prompting approach incorporating visual prompts and fusion modules, addressing limitations of text-only prompts in remote sensing object detection.
Findings
Visual prompting improves robustness under semantic ambiguity.
Multimodal prompting remains competitive with well-aligned textual semantics.
Extensive experiments validate the effectiveness across various benchmarks.
Abstract
Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Multimodal Machine Learning Applications · Advanced Neural Network Applications
