A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models
Reda Bensaid, Vincent Gripon, Fran\c{c}ois Leduc-Primeau, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux

TL;DR
This paper introduces a new benchmark for evaluating how well foundation vision models can be adapted for few-shot semantic segmentation, revealing the importance of feature extractors and adaptation methods.
Contribution
It presents the first study on adapting vision foundation models for few-shot semantic segmentation and proposes a realistic benchmark for this task.
Findings
Self-supervised models can outperform segmentation-specific models.
Parameter-efficient fine-tuning yields competitive results.
The feature extractor plays a critical role in adaptation performance.
Abstract
Few-shot semantic segmentation (FSS) is a crucial challenge in computer vision, driving extensive research into a diverse range of methods, from advanced meta-learning techniques to simple transfer learning baselines. With the emergence of vision foundation models (VFM) serving as generalist feature extractors, we seek to explore the adaptation of these models for FSS. While current FSS benchmarks focus on adapting pre-trained models to new tasks with few images, they emphasize in-domain generalization, making them less suitable for VFM trained on large-scale web datasets. To address this, we propose a novel realistic benchmark with a simple and straightforward adaptation process tailored for this task. Using this benchmark, we conduct a comprehensive comparative analysis of prominent VFM and semantic segmentation models. To evaluate their effectiveness, we leverage various adaption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Softmax · Layer Normalization · Residual Connection · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
