DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Anh M. Vu (equal contribution); Khang P. Le (equal contribution); Trang T. K. Vo (equal contribution); Ha Thach; Huy Hung Nguyen; David Yang; Han H. Huynh; Quynh Nguyen; Tuan M. Pham; Tuan-Anh Le; Minh H. N. Le; Thanh-Huy Nguyen; Akash Awasthi; Chandra Mohan; Zhu Han; Hien Van Nguyen

arXiv:2512.10314·cs.CV·December 12, 2025

DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Anh M. Vu (equal contribution), Khang P. Le (equal contribution), Trang T. K. Vo (equal contribution), Ha Thach, Huy Hung Nguyen, David Yang, Han H. Huynh, Quynh Nguyen, Tuan M. Pham, Tuan-Anh Le, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han

PDF

Open Access

TL;DR

DualProtoSeg introduces a prototype-driven framework combining text and image cues, along with multi-scale features, to enhance weakly supervised histopathology image segmentation, surpassing state-of-the-art methods.

Contribution

It presents a novel dual-modal prototype bank with text and image prototypes, and incorporates multi-scale pyramid modules for improved localization in weakly supervised segmentation.

Findings

01

Outperforms existing state-of-the-art on BCSS-WSSS benchmark.

02

Text description diversity and context length improve segmentation quality.

03

Combining textual and visual prototypes enhances region discovery.

Abstract

Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Domain Adaptation and Few-Shot Learning · Face recognition and analysis