DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation
Mohamad Zamini, Diksha Shukla

TL;DR
DouC is a training-free dual-branch CLIP framework for open-vocabulary segmentation that enhances local token reliability and spatial coherence without retraining.
Contribution
It introduces a novel dual-branch approach combining token gating and structural priors, improving zero-shot segmentation performance without additional training.
Findings
Outperforms prior training-free methods across eight benchmarks.
Scales favorably with different CLIP backbones.
Requires no retraining or additional learnable parameters.
Abstract
Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
