SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

Yi-Syuan Chen; Yun-Zhu Song; Cheng Yu Yeo; Bei Liu; Jianlong Fu,; Hong-Han Shuai

arXiv:2307.07742·cs.CV·August 22, 2023·1 cites

SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

Yi-Syuan Chen, Yun-Zhu Song, Cheng Yu Yeo, Bei Liu, Jianlong Fu,, Hong-Han Shuai

PDF

Open Access 1 Video

TL;DR

SINC is a self-supervised framework that enables in-context learning for vision-language tasks without relying on large language models' intrinsic abilities, reducing resource demands and improving few-shot performance.

Contribution

It introduces a meta-model trained on self-supervised prompts to facilitate in-context predictions, offering a resource-efficient alternative to large language model-based methods.

Findings

01

SINC outperforms gradient-based methods in few-shot vision-language tasks.

02

The framework reveals key components for in-context learning emergence.

03

SINC reduces computational resource requirements.

Abstract

Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: ``How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SINC: Self-Supervised In-Context Learning for Vision-Language Tasks· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling