ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Ke Li, Ting Wang, Di Wang, Yongshan Zhu, Yiming Zhang, Tao Lei, Quan Wang

TL;DR
ProVG introduces a progressive, decoupled language grounding framework for remote sensing imagery that enhances localization accuracy by leveraging fine-grained linguistic cues through a coarse-to-fine alignment process.
Contribution
It proposes a novel decoupling of linguistic cues and a progressive cross-modal modulation scheme tailored for remote sensing visual grounding tasks.
Findings
ProVG achieves state-of-the-art results on RRSIS-D and RISBench benchmarks.
The framework effectively utilizes spatial relations and object attributes for improved localization.
Extensive experiments validate the superiority of the proposed method over existing approaches.
Abstract
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
