Bayesian scaling laws for in-context learning
Aryaman Arora, Dan Jurafsky, Christopher Potts, Noah D. Goodman

TL;DR
This paper introduces a Bayesian perspective to in-context learning (ICL), deriving a new scaling law that explains how ICL approximates Bayesian inference, with experiments on GPT-2 models validating its predictive power and implications for safety alignment.
Contribution
It presents a novel Bayesian scaling law for ICL that interprets the process as Bayesian inference, providing insights into task priors and learning efficiency.
Findings
Bayesian scaling law accurately predicts ICL behavior.
Scaling law matches existing accuracy scaling laws.
Insights into safety and capability reemergence in LLMs.
Abstract
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. Prior work has established strong correlations between the number of in-context examples provided and the accuracy of the model's predictions. In this paper, we seek to explain this correlation by showing that ICL approximates a Bayesian learner. This perspective gives rise to a novel Bayesian scaling law for ICL. In experiments with \mbox{GPT-2} models of different sizes, our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities. To illustrate the analytic power that such interpretable scaling laws provide, we report on controlled synthetic dataset experiments designed to inform real-world studies of safety alignment. In our experimental protocol, we…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This work investigates an interesting problem. Based on some assumptions, it proposes a new theory and validates this theory with experiments. The work also does a good job comparing the proposed scaling law to prior works. 2. I appreciate that the experiments include both toy settings as well as real world LLMs.
The paper primarily focuses on transformers. However, it would be interesting to see some experiments for e.g., state space models, and whether ICL is bayesian in these models as well.
Overall, this is a very interesting paper, as the Bayesian scaling law offers insights into phenomena such as many-shot prompt jailbreaking. If the conclusions of this work hold, they could strengthen the argument that ICL operates in a Bayesian manner.
My primary concern lies with the functional form of the posterior expectation in Equation (2). Specifically, in Equation (2), over which random variables is the expectation taken? The notation used here is somewhat unclear to me. Additionally, in Equation (17) of the appendix, the authors appear to apply the linearity of expectation, which seems to assume that E(A/B)=E(A)/E(B). Am I missing something here?
Overall, the paper is well motivated, well written and clear. The experiments are all asking very important questions. Here is a list of strengths: Contributions: C1: The result clearly separating SFT’s effect on the prior vs in each distribution is interesting and useful to understand how ICL abilities are affected by SFT. C2: SFT being increasingly superficial (prior change only) on larger models is interesting, and suggests an important phenomena to study as we scale models. C3: In gener
Even though the paper is great at a conceptual level, unfortunately, many claims are weakly justified. I think the paper has enough qualitative contributions and does not require a justification of competence by scaling law “benchmarking”. In fact, I don’t think the paper’s contribution is degraded at all even if the fits went worse, as the importance of this paper, for me, is that the authors developed a framework to fit a scaling law and interpret its terms, allowing qualitative insights of ho
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsShrink and Fine-Tune
