MLPs Learn In-Context on Regression and Classification Tasks

William L. Tong; Cengiz Pehlevan

arXiv:2405.15618·cs.LG·February 26, 2025·2 cites

MLPs Learn In-Context on Regression and Classification Tasks

William L. Tong, Cengiz Pehlevan

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper demonstrates that multi-layer perceptrons (MLPs) can learn in-context tasks, perform comparably to Transformers, and even outperform them on certain relational reasoning tasks, challenging the notion that in-context learning is unique to attention-based models.

Contribution

It reveals that MLPs can learn in-context and perform relational reasoning, expanding the understanding of architectures capable of in-context learning beyond Transformers.

Findings

01

MLPs can learn in-context on synthetic tasks.

02

MLPs and MLP-Mixer models perform comparably to Transformers under same compute.

03

MLPs outperform Transformers on classical relational reasoning tasks.

Abstract

In-context learning (ICL), the remarkable ability to solve a task from only input exemplars, is often assumed to be a unique hallmark of Transformer models. By examining commonly employed synthetic ICL tasks, we demonstrate that multi-layer perceptrons (MLPs) can also learn in-context. Moreover, MLPs, and the closely related MLP-Mixer models, learn in-context comparably with Transformers under the same compute budget in this setting. We further show that MLPs outperform Transformers on a series of classical tasks from psychology designed to test relational reasoning, which are closely related to in-context classification. These results underscore a need for studying in-context learning beyond attention-based architectures, while also challenging prior arguments against MLPs' ability to solve relational tasks. Altogether, our results highlight the unexpected competence of MLPs in a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. This is the first paper, to my knowledge, that highlights that MLPs alone can lead to incontext learning. This is an interesting finding since, at least intuitively, the belief is that self attention helps with ICL. Verification of in-weight to in-context transition with task diversity, for all architectures, was also an interesting finding 2. Presentation and discussion of results is clear 3. For the set of ICL problems considered, the analysis seems quite extensive

Weaknesses

1. The analysis is mostly in stylized and restricted settings. It is not entirely clear what this means for kinds of ICL that is observed in realistic settings (this is also mentioned in the limitations section of the paper). Even within simplistic settings, some more complex problems can be considered to make the claim that MLPs are competitive with Transformers. See questions 5 and 6 below. 2. Some useful description of the experimental setup, like input distribution, how MLP and MLP-Mixer w

Reviewer 02Rating 8Confidence 4

Strengths

The paper is quite well writtin and in my opinion easy to follow. The experiments seems well executed and believable, some questions remain, see below.

Weaknesses

The paper in my opinion overclaims the signifiance of the work, of how surprising the findings are. MLPs are universal function approximators, and ofc, can to some extend approximate self-attention layers. Its nevertheless somewhat interesting that gradient descent can install such solutions into architectures purely consisting of MLPs. It is, especially on tractable problems such as linear regression / classification, clear that, if optimized well, neural networks will find / approximate the (

Reviewer 03Rating 8Confidence 4

Strengths

1. Every experiment in the paper is designed thoroughly. 2. This is the first work encountered that explores the ICL capabilities of MLPs, which could be relevant to the literature on foundation models, especially in time series. 3. The addition of relational tasks to the existing synthetic regression and classification experiments contributes valuable insights into Transformer limitations. Transformers perform poorly when test exemplars differ significantly from the training data.

Weaknesses

The paper could have included real regression data. Most existing literature focuses on synthetic tasks, and exploring real data (even simple regression datasets) with somewhat complex underlying distributions would have added valuable insights.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Child and Animal Learning Development · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Average Pooling · Global Average Pooling · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer