SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen, Yun-Zhu Song, Cheng Yu Yeo, Bei Liu, Jianlong Fu,, Hong-Han Shuai

TL;DR
SINC is a self-supervised framework that enables in-context learning for vision-language tasks without relying on large language models' intrinsic abilities, reducing resource demands and improving few-shot performance.
Contribution
It introduces a meta-model trained on self-supervised prompts to facilitate in-context predictions, offering a resource-efficient alternative to large language model-based methods.
Findings
SINC outperforms gradient-based methods in few-shot vision-language tasks.
The framework reveals key components for in-context learning emergence.
SINC reduces computational resource requirements.
Abstract
Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: ``How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
