Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira, M. Prasad

TL;DR
This paper introduces a training-free, text-to-text framework for remote sensing image retrieval that leverages structured captions and VLM-generated descriptions, achieving competitive results without model training.
Contribution
The paper presents RSRT, a new benchmark dataset, and TRSLLaVA, a novel zero-shot, training-free text-to-text retrieval method for remote sensing images.
Findings
Achieves nearly double the recall of baseline zero-shot methods.
Outperforms several supervised models on RSITMD.
Validates structured text as effective for semantic image retrieval.
Abstract
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
