Applying ViT in Generalized Few-shot Semantic Segmentation
Liyuan Geng, Jinhong Xia, Yuanhe Guo

TL;DR
This paper evaluates the effectiveness of Vision Transformer (ViT) models in generalized few-shot semantic segmentation, showing that pretrained ViT models significantly outperform ResNet-based models on benchmarks.
Contribution
It demonstrates the superior performance of ViT-based models, especially DINOv2 with a linear classifier, in GFSS tasks compared to traditional ResNet models.
Findings
ViT models outperform ResNet in GFSS benchmarks.
DINOv2 with linear classifier achieves 116% improvement in one-shot segmentation.
Large ViT models are prone to overfitting in GFSS applications.
Abstract
This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsAttention Is All You Need · Average Pooling · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Global Average Pooling
