Scaling Laws and In-Context Learning: A Unified Theoretical Framework
Sushant Mehta, Ishan Gupta

TL;DR
This paper presents a unified theoretical framework linking scaling laws to in-context learning in transformers, revealing how model size and structure influence learning capabilities and phase transitions.
Contribution
It introduces a comprehensive theory connecting scaling laws to ICL emergence, including phase transitions and optimal model size allocations, validated by systematic experiments.
Findings
ICL performance follows power-law scaling with model parameters.
Transformers implement gradient-based meta-learning in their forward pass.
Sharp phase transitions occur at critical model scales.
Abstract
In-context learning (ICL) enables large language models to adapt to new tasks from demonstrations without parameter updates. Despite extensive empirical studies, a principled understanding of ICL emergence at scale remains more elusive. We present a unified theoretical framework connecting scaling laws to ICL emergence in transformers. Our analysis establishes that ICL performance follows power-law relationships with model depth , width , context length , and training data , with exponents determined by task structure. We show that under specific conditions, transformers implement gradient-based metalearning in their forward pass, with an effective learning rate . We demonstrate sharp phase transitions at critical scales and derive optimal depth-width allocations favoring , for the fixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
