Context-Scaling versus Task-Scaling in In-Context Learning
Amirhesam Abedsoltan, Adityanarayanan Radhakrishnan, Jingfeng Wu,, Mikhail Belkin

TL;DR
This paper investigates how transformers perform in in-context learning, distinguishing between context-scaling and task-scaling, and demonstrates that a simplified transformer can achieve context-scaling, with combined methods enabling both types of scaling.
Contribution
The authors introduce a simplified transformer architecture that performs comparably to GPT-2 in ICL tasks and analyze how context-scaling and task-scaling can be achieved separately and together.
Findings
Simplified transformer performs comparably to GPT-2 in ICL.
A feature map can enable context-scaling but not task-scaling.
Concatenating feature map output with data enables both context- and task-scaling.
Abstract
Transformers exhibit In-Context Learning (ICL), where these models solve new tasks by using examples in the prompt without additional training. In our work, we identify and analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer architecture without key, query, value weights. We show that it performs ICL comparably to the original GPT-2 model in various statistical learning tasks including linear…
Peer Reviews
Decision·Submitted to ICLR 2025
- I really liked the introduction and thought that the question of context-scaling was nicely set up. - I appreciated the authors considering a wide range of different tasks that they collected from the relevant literature. - Building part of the SGPT into the MLP was a neat method for illustrating the origin of the distinct mechanism between either.
Unfortunately, I do not think this work is ready for publication in its current state. Primarily, I believe that the paper does not provide sufficiently novel insight from the prior literature. Notably, improvement of Transformer performance with the number of in-context examples was already noted (as the authors lay out in the related work section), e.g. in Bai et al. (2023) and the prior theoretical literature also explains why this would be the case, as it draws the connection between ICL in
The SGPT considered and its experiments are novel. And it is quite surprising and interesting to see that its performance is comparable to GPT2. I suppose it is due to the simplicity of the ICL tasks conducted in the paper. The idea of connecting ICL and kernel smoothly is clearly presented and is of insight. The separation of context-scaling and task-scaling via feature from kernel estimate and vectorized input is novel and can potentially help us understand their impacts better individuall
Though with the consistency of Hilbert estimate, how exactly the transformer performs Hilbert estimate i.e., via the construction of activation function in attention, is not straightforward. What is the major intuition of taking key, query, and value matrices to be identity? Such intuition is vital since it is shown that the context-scaling capability is attributed to the attention, and task-scaling is to the MLP with vectorized data. I wonder if key, query, and value matrices are learnable, w
I found the paper overall to be very well-written and a pleasure to read. The topic is extremely important, particularly in our post-ChatGPT era, and deals with a critical ability in Transformers. The contrast to MLPs sparks a fascinating discussion about the relative merits of different architectures.
I would love to see this manuscript published at ICLR, but there are a few oversights that prevent me from assigning a higher score. If these are able to be addressed, I will be delighted to raise my score. The discussion on context-scaling in MLPs appears to be drawing from prior work by Tong and Pehlevan (https://arxiv.org/abs/2405.15618). The authors claim that MLPs do not context-scale, but Tong and Pehlevan seem to be showing otherwise. I may be misunderstanding both sides here, but Fig 1d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Anomaly Detection Techniques and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Linear Layer · Multi-Head Attention · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Adam · Attention Dropout
