DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models

Zhe Dong; Yuzhe Sun; Tianzhu Liu; Yanfeng Gu

arXiv:2506.18946·cs.CV·June 25, 2025

DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models

Zhe Dong, Yuzhe Sun, Tianzhu Liu, Yanfeng Gu

PDF

TL;DR

DiffRIS leverages pre-trained text-to-image diffusion models with innovative modules to significantly improve the accuracy of referring remote sensing image segmentation, addressing challenges like scale and orientation variations.

Contribution

The paper introduces DiffRIS, a novel framework that uses diffusion models for enhanced cross-modal alignment in remote sensing segmentation, with a context perception adapter and a progressive reasoning decoder.

Findings

01

Outperforms existing methods on three benchmark datasets.

02

Achieves new state-of-the-art results in RRSIS tasks.

03

Demonstrates the effectiveness of diffusion models in remote sensing applications.

Abstract

Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.