General-Purpose In-Context Learning by Meta-Learning Transformers
Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, Luke Metz

TL;DR
This paper demonstrates that Transformers can be meta-trained to serve as general-purpose in-context learners, capable of adapting to diverse tasks without explicit inference models, by analyzing their transition behaviors and optimizing training strategies.
Contribution
It introduces a method to meta-train Transformers as general-purpose in-context learners and analyzes the factors affecting their generalization and memorization capabilities.
Findings
Transformers can be meta-trained to perform in-context learning across various tasks.
Model size, number of tasks, and meta-optimization influence learning algorithms.
Memory size, not just parameter count, bottlenecks in meta-trained models.
Abstract
Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from scratch, using only black-box models with minimal inductive bias. Such a model takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general-purpose in-context learners. We characterize transitions between algorithms that generalize, algorithms that memorize, and…
Peer Reviews
Decision·Submitted to ICLR 2024
1. This paper performed experiments on image classification datasets to demonstrate that transformers can be meta-trained to perform in-context learning. 2. Figure 2 gives convincing evidence of a transition from memorization and generalization induced by model capacity and sample size. 3. This paper provides practical interventions to improve meta-training.
1. The writing is not completely clear. For example, "general-purpose in-context learning" is a vague term without a rigorous mathematical definition. This makes the paper a bit hard to read. 2. The memory or state in Section 4.2 is quite heuristic without a concrete math definition. Beyond LSTM and transformers, it is not clear how the state is defined. The insight that "Large state is more crucial than parameter count" is thus not fully grounded. The contributions of this paper were signifi
N/A
N/A
- Presents a simple baseline model (GPICL) for meta-learning general purpose learners with minimal inductive bias. Shows competitive performance compared to models with stronger inductive biases. - Provides interesting insights into the transitions from memorization to task identification to general learning as model size and number of tasks increase during meta-training. Identifies the accessible state/memory size as a key bottleneck for meta-learning capabilities, rather than just model param
- Authors use CIFAR10, MNIST, FashionMNIST and SVHN as their datasets. Those are rather simple datasets and it would be good to see if the findings generalizes well to harder and larger datasets. Most importantly it would be interesting to show that the method is performing well due to its inherent ability to learn rather than the datasets being easy. - The authors do not make it explicitly clear what elements of their new setup is their contribution and which is already present in other papers.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
Methodsfail
