Applying ViT in Generalized Few-shot Semantic Segmentation

Liyuan Geng; Jinhong Xia; Yuanhe Guo

arXiv:2408.14957·cs.CV·August 28, 2024

Applying ViT in Generalized Few-shot Semantic Segmentation

Liyuan Geng, Jinhong Xia, Yuanhe Guo

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the effectiveness of Vision Transformer (ViT) models in generalized few-shot semantic segmentation, showing that pretrained ViT models significantly outperform ResNet-based models on benchmarks.

Contribution

It demonstrates the superior performance of ViT-based models, especially DINOv2 with a linear classifier, in GFSS tasks compared to traditional ResNet models.

Findings

01

ViT models outperform ResNet in GFSS benchmarks.

02

DINOv2 with linear classifier achieves 116% improvement in one-shot segmentation.

03

Large ViT models are prone to overfitting in GFSS applications.

Abstract

This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL- $5^{i}$ , substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lgnyu/vitseg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsAttention Is All You Need · Average Pooling · Linear Layer · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Global Average Pooling