Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

Mohammadreza Heidarianbaei; Mareike Dorozynski; Hubert Kanyamahanga; Max Mehltretter; Franz Rottensteiner

arXiv:2602.23869·cs.CV·March 2, 2026

Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner

PDF

Open Access

TL;DR

This paper introduces ReSeg-CLIP, a training-free open-vocabulary semantic segmentation method for remote sensing that leverages hierarchical attention masking and model composition to improve performance without extra training.

Contribution

It presents a novel hierarchical masking scheme and a model composition approach to enhance CLIP-based segmentation in remote sensing without additional training.

Findings

01

Achieves state-of-the-art results on three remote sensing benchmarks.

02

Effectively constrains self-attention interactions using SAM masks.

03

Utilizes a new weighting scheme for model averaging based on text prompt quality.

Abstract

In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Remote-Sensing Image Classification