DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Zhiwei Yang; Pengfei Song; Yucong Meng; Kexue Fu; Shuo Wang; Zhijian Song

arXiv:2605.04593·cs.CV·May 7, 2026

DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation

Zhiwei Yang, Pengfei Song, Yucong Meng, Kexue Fu, Shuo Wang, Zhijian Song

PDF

1 Repo

TL;DR

DiCLIP introduces a diffusion model-based framework to enhance CLIP's dense knowledge for weakly supervised semantic segmentation, improving localization accuracy and reducing training costs.

Contribution

It proposes novel modules leveraging diffusion models to enhance CLIP's visual and textual features for better segmentation performance.

Findings

01

Outperforms state-of-the-art on PASCAL VOC and MS COCO

02

Reduces training costs significantly

03

Enhances dense prediction accuracy

Abstract

Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically leverages Class Activation Maps (CAMs) to achieve pixel-level predictions. Recently, Contrastive Language-Image Pre-training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP's vision-language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP's dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion's reliable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zwyang6/DiCLIP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.