TL;DR
RSRefSeg introduces a foundation model for referring remote sensing image segmentation that combines CLIP and SAM to improve fine-grained visual understanding and segmentation accuracy in remote sensing applications.
Contribution
It presents a novel framework that leverages CLIP and SAM for better multimodal alignment and segmentation in remote sensing images, addressing limitations of previous methods.
Findings
Outperforms existing methods on RRSIS-D dataset
Effectively aligns fine-grained semantic concepts across modalities
Enhances segmentation accuracy in remote sensing images
Abstract
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN · Contrastive Language-Image Pre-training · Segment Anything Model
