DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models
Zhe Dong, Yuzhe Sun, Tianzhu Liu, Yanfeng Gu

TL;DR
DiffRIS leverages pre-trained text-to-image diffusion models with innovative modules to significantly improve the accuracy of referring remote sensing image segmentation, addressing challenges like scale and orientation variations.
Contribution
The paper introduces DiffRIS, a novel framework that uses diffusion models for enhanced cross-modal alignment in remote sensing segmentation, with a context perception adapter and a progressive reasoning decoder.
Findings
Outperforms existing methods on three benchmark datasets.
Achieves new state-of-the-art results in RRSIS tasks.
Demonstrates the effectiveness of diffusion models in remote sensing applications.
Abstract
Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
