DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

Zhen Yao; Xin Li; Taotao Jing; Shuai Zhang; Mooi Choo Chuah

arXiv:2601.20064·cs.CV·January 29, 2026

DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation

Zhen Yao, Xin Li, Taotao Jing, Shuai Zhang, Mooi Choo Chuah

PDF

Open Access 3 Reviews

TL;DR

DiSa introduces a saliency-aware framework for open-vocabulary semantic segmentation that effectively disentangles foreground and background features, leading to improved boundary localization and reduced bias.

Contribution

The paper proposes a novel saliency-aware disentanglement framework with hierarchical refinement to address foreground bias and spatial localization issues in VLM-based segmentation.

Findings

01

Outperforms state-of-the-art on six benchmarks

02

Effectively models foreground and background separately

03

Improves boundary accuracy and reduces bias

Abstract

Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Both the DSM and HRM are shown to effectively mitigate their issues and the ablation studies show a strong motivation for the current elements in the method. - Good experimental results and the increase in PAS-20b indicates the results are supported by the theory. - Good performance with no additional datasets and low computational cost (GFLOPs).

Weaknesses

- The entire disentanglement pipeline is dependent on the quality and accuracy of the ITM loss. If it fails to localize the object for novel classes the split will be flawed. - While GFLOPs are low the pipeline is complex which can introduce a higher training overhead and fragility (more failure points). - k=96 is a fixed value. This doesn't account for object that vary widely in size or partially visible which can impact performance on diverse scenes.

Reviewer 02Rating 6Confidence 4

Strengths

1. The fundamental problem is well-identified and critical: The authors pinpoint the crucial issue of Vision-Language Models (VLMs) being inherently biased toward foreground objects in dense prediction tasks like semantic segmentation. Their proposed saliency map-based method provides an elegant and direct solution for decoupling foreground and background representations. 2. The empirical results are well-validated: The model not only achieves State-of-the-Art (SOTA) performance but demonstrate

Weaknesses

1. The proposed module to extract foreground/background region is based GradCAM. While maintaining a comparable GFLOP count, the reliance on a gradient-based method like GradCAM may lead to a slower inference speed due to the required backward pass.

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper is clearly written. 2. The idea is novel and make senses to me. 3. The modular design can easily slot into CLIP-based baselines.

Weaknesses

1. The saliency is derived from cross-attention + Grad-CAM reweighting via an auxiliary ITM loss tied to segmentation supervision. It may introduce label leakage. 2. The performance gains are limited from Table1. 3. The foreground/background Token Selection is a bit ambiguous. It is better to elaborate more. And is it possible to use the model to do foreground/background segmentation as well?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Advanced Neural Network Applications