In Context Learning with Vision Transformers: Case Study

Antony Zhao; Alex Proshkin; Fergal Hennessy; Francesco Crivelli

arXiv:2505.20872·cs.CV·May 28, 2025

In Context Learning with Vision Transformers: Case Study

Antony Zhao, Alex Proshkin, Fergal Hennessy, Francesco Crivelli

PDF

Open Access

TL;DR

This paper investigates the ability of large vision transformer models to perform in-context learning on complex image functions, extending prior work from simple data to the image domain.

Contribution

It explores the in-context learning capabilities of vision transformers for complex image functions like CNNs, which has not been extensively studied before.

Findings

01

Transformers can learn linear functions in image space

02

Potential to learn neural network functions in images

03

Extends in-context learning analysis to complex visual tasks

Abstract

Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Currency Recognition and Detection · Video Surveillance and Tracking Methods