In Context Learning with Vision Transformers: Case Study
Antony Zhao, Alex Proshkin, Fergal Hennessy, Francesco Crivelli

TL;DR
This paper investigates the ability of large vision transformer models to perform in-context learning on complex image functions, extending prior work from simple data to the image domain.
Contribution
It explores the in-context learning capabilities of vision transformers for complex image functions like CNNs, which has not been extensively studied before.
Findings
Transformers can learn linear functions in image space
Potential to learn neural network functions in images
Extends in-context learning analysis to complex visual tasks
Abstract
Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Currency Recognition and Detection · Video Surveillance and Tracking Methods
