Toward Understanding In-context vs. In-weight Learning
Bryan Chan, Xinyi Chen, Andr\'as Gy\"orgy, Dale Schuurmans

TL;DR
This paper provides a theoretical framework explaining how distributional properties influence the emergence and disappearance of in-context learning in transformers, supported by experiments on simplified models and large language models.
Contribution
It introduces a simplified model with a gating mechanism to analyze in-context versus in-weight learning and extends findings to large language models through fine-tuning experiments.
Findings
In-context learning emerges under specific distributional conditions.
Fine-tuning can induce in-context and in-weight learning behaviors.
Theoretical predictions align with experimental results on transformers.
Abstract
It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper introduces a novel theoretical framework that investigates how a model selects between ICL and IWL. This framework offers insights into the conditions under which ICL emerges, especially highlighting the role of rare labels in promoting ICL. 2. The writing and presentation is clear. 3. this paper includes experiments in both synthetic and natural few-shot learning dataset and validated their findings.
1. This paper assumes a simple switch between the IWL and ICL based on error minimization but in real world LLM the transition may not be instantaneous and involves more complexities. 2. The in-context predictor is modeled as a convex combination of the labels in the context, weighted by input similarity. This raises questions because more commonly the in-context predictor should learn the relationship between inputs and outputs (i.e., the mapping from x to y rather than combining labels. This
- Reasonable theoretical foundation for their arguments - Well-designed experiments to test these theoretical predictions Overall I think this has the makings of a good paper that makes a significant and original contribution to an important thread of research in the science of transformer generalization.
- At the moment I am not convinced that the experiments reported in Fig 2 really do support the theoretical predictions. I hope this is just my misunderstanding and that the authors can explain what I missed, and perhaps provide a clearer explanation in the paper. - While I am pretty well-versed in this corner of the literature and so prepared to care about the contents of Fig 2, it still took me a while to process all the acronyms and setup.
The paper addresses an important topic in the science of deep learning, namely the phenomena of the emergence and transience of in-context learning (versus in-weight learning). On this topic the paper presents an interesting learning setting involving an in-context supervised classification task that captures several important elements of in-context learning. At the same time, this setting is simpler in some ways than in-context linear regression or other comparable synthetic learning settings.
**W1. Mismatch with Singh et al.'s setting.** In the introduction you write that your analysis "explains the findings of Singh et al. (2023)," namely the transience of in-context learning, giving way to in-weight learning. It doesn't seem to me that your model captures the phenomenon observed by Singh et al. In my understanding, a fundamental feature of the setting of Singh et al. is that there is no train-loss incentive for an ICL solution over an IWL solution, because both paradigms are equall
Videos
Taxonomy
TopicsPhysical Education and Pedagogy
