TL;DR
RSRefSeg 2 introduces a decoupled, two-stage framework combining CLIP and SAM models for improved referring remote sensing image segmentation, enhancing accuracy and semantic understanding.
Contribution
It proposes a novel decoupling paradigm that separates localization and segmentation, integrating foundation models for better cross-modal alignment and interpretability.
Findings
Outperforms existing methods by approximately 3% gIoU in segmentation accuracy.
Effectively handles complex semantic relationships in remote sensing images.
Demonstrates superior generalizability and interpretability in experiments.
Abstract
Referring Remote Sensing Image Segmentation provides a flexible and fine-grained framework for remote sensing scene analysis via vision-language collaborative interpretation. Current approaches predominantly utilize a three-stage pipeline encompassing dual-modal encoding, cross-modal interaction, and pixel decoding. These methods demonstrate significant limitations in managing complex semantic relationships and achieving precise cross-modal alignment, largely due to their coupled processing mechanism that conflates target localization with boundary delineation. This architectural coupling amplifies error propagation under semantic ambiguity while restricting model generalizability and interpretability. To address these issues, we propose RSRefSeg 2, a decoupling paradigm that reformulates the conventional workflow into a collaborative dual-stage framework: coarse localization followed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSegment Anything Model · Contrastive Language-Image Pre-training
