TAG: Guidance-free Open-Vocabulary Semantic Segmentation
Yasufumi Kawano, Yoshimitsu Aoki

TL;DR
The paper introduces TAG, a novel guidance-free open-vocabulary semantic segmentation method that leverages pre-trained models to segment images without additional training or dense annotations, achieving state-of-the-art results.
Contribution
TAG is the first approach to perform open-vocabulary segmentation without guidance or training, using pre-trained models and external class label retrieval.
Findings
Achieves +15.3 mIoU improvement on PascalVOC
State-of-the-art results on PascalContext and ADE20K
Operates without class name guidance or additional training
Abstract
Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive training. Furthermore, because supervised learning uses a limited set of predefined categories, models typically struggle with rare classes and cannot recognize new ones. Unsupervised and open-vocabulary segmentation, proposed to tackle these issues, faces challenges, including the inability to assign specific class labels to clusters and the necessity of user-provided text queries for guidance. In this context, we propose a novel approach, TAG which achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment images into meaningful categories without additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Dense Connections · Softmax · Layer Normalization · Multi-Head Attention · Residual Connection · Vision Transformer · self-DIstillation with NO labels
