Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Xi Chen; Xu Chen; Xiangyang Jia; Xu Zhang; Shuquan Wei; Wei Wang

arXiv:2604.20429·cs.CV·April 23, 2026

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Xi Chen, Xu Chen, Xiangyang Jia, Xu Zhang, Shuquan Wei, Wei Wang

PDF

TL;DR

This paper introduces a two-stage framework for remote sensing image-text retrieval that balances efficiency and fine-grained alignment using multi-granular representations and a novel reranking approach.

Contribution

The proposed Fast-then-Fine framework decomposes retrieval into a fast candidate selection and a fine-grained rerank, improving efficiency without sacrificing accuracy.

Findings

01

Achieves competitive accuracy on public benchmarks.

02

Significantly improves retrieval efficiency over existing methods.

03

Effectively balances coarse and fine-grained cross-modal alignment.

Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.