Dual Progressive Transformations for Weakly Supervised Semantic Segmentation
Dongjian Huo, Yukun Su, Qingyao Wu

TL;DR
This paper introduces CRT, a novel CNN-refined transformer approach that improves weakly supervised semantic segmentation by generating more accurate class activation maps, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper proposes a CNN-refined transformer (CRT) that effectively combines global attention and local accuracy for WSSS, addressing over-activation issues of pure transformers.
Findings
CRT achieves new state-of-the-art performance on PASCAL VOC 2012.
CRT outperforms previous methods significantly in weakly supervised object localization.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Weakly supervised semantic segmentation (WSSS), which aims to mine the object regions by merely using class-level labels, is a challenging task in computer vision. The current state-of-the-art CNN-based methods usually adopt Class-Activation-Maps (CAMs) to highlight the potential areas of the object, however, they may suffer from the part-activated issues. To this end, we try an early attempt to explore the global feature attention mechanism of vision transformer in WSSS task. However, since the transformer lacks the inductive bias as in CNN models, it can not boost the performance directly and may yield the over-activated problems. To tackle these drawbacks, we propose a Convolutional Neural Networks Refined Transformer (CRT) to mine a globally complete and locally accurate class activation maps in this paper. To validate the effectiveness of our proposed method, extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Label Smoothing · Multi-Head Attention · Adam · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Layer Normalization
