Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation
Joelle Hanna, Damian Borth

TL;DR
This paper introduces an end-to-end method leveraging Vision Transformer attention maps for weakly supervised semantic segmentation, improving pseudo-mask quality and reducing reliance on detailed annotations.
Contribution
It proposes training a sparse ViT with multiple class-specific [CLS] tokens and a masking strategy to generate accurate pseudo-masks directly from attention maps.
Findings
Outperforms existing methods on standard benchmarks
Generates pseudo-masks comparable to fully-supervised models
Reduces need for detailed pixel-level annotations
Abstract
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
