Self-Supervised Open-Ended Classification with Small Visual Language Models
Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek,, Marcel Worring, Yuki M. Asano

TL;DR
SeCAt is a self-supervised method that enhances small visual language models' few-shot classification abilities by using clustering and pseudo-captioning, outperforming larger models on various datasets.
Contribution
Introduces SeCAt, a novel self-supervised approach that improves open-ended few-shot classification in small visual language models using clustering and pseudo-captioning techniques.
Findings
SeCAt outperforms larger models like Frozen and FROMAGe on several datasets.
The method enables small models to achieve competitive few-shot classification performance.
SeCAt facilitates open-ended learning without large or proprietary models.
Abstract
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. Our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct a training signal consisting of interleaved sequences of image and pseudocaption pairs and a query image, which we denote as the 'self-context' sequence. Based on this signal the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
