Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner

TL;DR
This paper introduces ReSeg-CLIP, a training-free open-vocabulary semantic segmentation method for remote sensing that leverages hierarchical attention masking and model composition to improve performance without extra training.
Contribution
It presents a novel hierarchical masking scheme and a model composition approach to enhance CLIP-based segmentation in remote sensing without additional training.
Findings
Achieves state-of-the-art results on three remote sensing benchmarks.
Effectively constrains self-attention interactions using SAM masks.
Utilizes a new weighting scheme for model averaging based on text prompt quality.
Abstract
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Remote-Sensing Image Classification
