DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation
Zhen Yao, Xin Li, Taotao Jing, Shuai Zhang, Mooi Choo Chuah

TL;DR
DiSa introduces a saliency-aware framework for open-vocabulary semantic segmentation that effectively disentangles foreground and background features, leading to improved boundary localization and reduced bias.
Contribution
The paper proposes a novel saliency-aware disentanglement framework with hierarchical refinement to address foreground bias and spatial localization issues in VLM-based segmentation.
Findings
Outperforms state-of-the-art on six benchmarks
Effectively models foreground and background separately
Improves boundary accuracy and reduces bias
Abstract
Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module…
Peer Reviews
Decision·Submitted to ICLR 2026
- Both the DSM and HRM are shown to effectively mitigate their issues and the ablation studies show a strong motivation for the current elements in the method. - Good experimental results and the increase in PAS-20b indicates the results are supported by the theory. - Good performance with no additional datasets and low computational cost (GFLOPs).
- The entire disentanglement pipeline is dependent on the quality and accuracy of the ITM loss. If it fails to localize the object for novel classes the split will be flawed. - While GFLOPs are low the pipeline is complex which can introduce a higher training overhead and fragility (more failure points). - k=96 is a fixed value. This doesn't account for object that vary widely in size or partially visible which can impact performance on diverse scenes.
1. The fundamental problem is well-identified and critical: The authors pinpoint the crucial issue of Vision-Language Models (VLMs) being inherently biased toward foreground objects in dense prediction tasks like semantic segmentation. Their proposed saliency map-based method provides an elegant and direct solution for decoupling foreground and background representations. 2. The empirical results are well-validated: The model not only achieves State-of-the-Art (SOTA) performance but demonstrate
1. The proposed module to extract foreground/background region is based GradCAM. While maintaining a comparable GFLOP count, the reliance on a gradient-based method like GradCAM may lead to a slower inference speed due to the required backward pass.
1. The paper is clearly written. 2. The idea is novel and make senses to me. 3. The modular design can easily slot into CLIP-based baselines.
1. The saliency is derived from cross-attention + Grad-CAM reweighting via an auxiliary ITM loss tied to segmentation supervision. It may introduce label leakage. 2. The performance gains are limited from Table1. 3. The foreground/background Token Selection is a bit ambiguous. It is better to elaborate more. And is it possible to use the model to do foreground/background segmentation as well?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Advanced Neural Network Applications
