MLPs Learn In-Context on Regression and Classification Tasks
William L. Tong, Cengiz Pehlevan

TL;DR
This paper demonstrates that multi-layer perceptrons (MLPs) can learn in-context tasks, perform comparably to Transformers, and even outperform them on certain relational reasoning tasks, challenging the notion that in-context learning is unique to attention-based models.
Contribution
It reveals that MLPs can learn in-context and perform relational reasoning, expanding the understanding of architectures capable of in-context learning beyond Transformers.
Findings
MLPs can learn in-context on synthetic tasks.
MLPs and MLP-Mixer models perform comparably to Transformers under same compute.
MLPs outperform Transformers on classical relational reasoning tasks.
Abstract
In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a…
Peer Reviews
Decision·ICLR 2025 Poster
1. This is the first paper, to my knowledge, that highlights that MLPs alone can lead to incontext learning. This is an interesting finding since, at least intuitively, the belief is that self attention helps with ICL. Verification of in-weight to in-context transition with task diversity, for all architectures, was also an interesting finding 2. Presentation and discussion of results is clear 3. For the set of ICL problems considered, the analysis seems quite extensive
1. The analysis is mostly in stylized and restricted settings. It is not entirely clear what this means for kinds of ICL that is observed in realistic settings (this is also mentioned in the limitations section of the paper). Even within simplistic settings, some more complex problems can be considered to make the claim that MLPs are competitive with Transformers. See questions 5 and 6 below. 2. Some useful description of the experimental setup, like input distribution, how MLP and MLP-Mixer w
The paper is quite well writtin and in my opinion easy to follow. The experiments seems well executed and believable, some questions remain, see below.
The paper in my opinion overclaims the signifiance of the work, of how surprising the findings are. MLPs are universal function approximators, and ofc, can to some extend approximate self-attention layers. Its nevertheless somewhat interesting that gradient descent can install such solutions into architectures purely consisting of MLPs. It is, especially on tractable problems such as linear regression / classification, clear that, if optimized well, neural networks will find / approximate the (
1. Every experiment in the paper is designed thoroughly. 2. This is the first work encountered that explores the ICL capabilities of MLPs, which could be relevant to the literature on foundation models, especially in time series. 3. The addition of relational tasks to the existing synthetic regression and classification experiments contributes valuable insights into Transformer limitations. Transformers perform poorly when test exemplars differ significantly from the training data.
The paper could have included real regression data. Most existing literature focuses on synthetic tasks, and exploring real data (even simple regression datasets) with somewhat complex underlying distributions would have added valuable insights.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Child and Animal Learning Development · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Average Pooling · Global Average Pooling · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer
