DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Mohamad Zamini; Diksha Shukla

arXiv:2604.24997·cs.CV·April 29, 2026

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Mohamad Zamini, Diksha Shukla

PDF

TL;DR

DouC is a training-free dual-branch CLIP framework for open-vocabulary segmentation that enhances local token reliability and spatial coherence without retraining.

Contribution

It introduces a novel dual-branch approach combining token gating and structural priors, improving zero-shot segmentation performance without additional training.

Findings

01

Outperforms prior training-free methods across eight benchmarks.

02

Scales favorably with different CLIP backbones.

03

Requires no retraining or additional learnable parameters.

Abstract

Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.