In-context learning and Occam's razor
Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie

TL;DR
This paper connects in-context learning in sequence models to Occam's razor, showing that minimizing next-token loss implicitly balances training error and model complexity, offering insights into improving in-context learning methods.
Contribution
It establishes a theoretical link between in-context learning and data compression, providing a normative framework and identifying limitations of current methods.
Findings
Next-token prediction loss is equivalent to prequential coding.
Minimizing this loss balances training error and model complexity.
Empirical results support the theoretical connection and suggest improvements.
Abstract
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmotions and Moral Behavior · Epistemology, Ethics, and Metaphysics
