InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition
Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang

TL;DR
InstructSAM is a training-free, instruction-driven framework for remote sensing object recognition that leverages large vision-language models to interpret instructions and efficiently identify objects without extensive training.
Contribution
The paper introduces InstructSAM, a novel training-free approach that interprets instructions for remote sensing object recognition, along with a new benchmark dataset and tasks for open-vocabulary scenarios.
Findings
InstructSAM matches or surpasses specialized baselines in multiple tasks.
It maintains near-constant inference time regardless of object count.
Reduces output tokens by 89% and runtime by over 32% compared to direct generation.
Abstract
Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
