Bayesian scaling laws for in-context learning

Aryaman Arora; Dan Jurafsky; Christopher Potts; Noah D. Goodman

arXiv:2410.16531·cs.CL·September 23, 2025

Bayesian scaling laws for in-context learning

Aryaman Arora, Dan Jurafsky, Christopher Potts, Noah D. Goodman

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a Bayesian perspective to in-context learning (ICL), deriving a new scaling law that explains how ICL approximates Bayesian inference, with experiments on GPT-2 models validating its predictive power and implications for safety alignment.

Contribution

It presents a novel Bayesian scaling law for ICL that interprets the process as Bayesian inference, providing insights into task priors and learning efficiency.

Findings

01

Bayesian scaling law accurately predicts ICL behavior.

02

Scaling law matches existing accuracy scaling laws.

03

Insights into safety and capability reemergence in LLMs.

Abstract

In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. Prior work has established strong correlations between the number of in-context examples provided and the accuracy of the model's predictions. In this paper, we seek to explain this correlation by showing that ICL approximates a Bayesian learner. This perspective gives rise to a novel Bayesian scaling law for ICL. In experiments with \mbox{GPT-2} models of different sizes, our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities. To illustrate the analytic power that such interpretable scaling laws provide, we report on controlled synthetic dataset experiments designed to inform real-world studies of safety alignment. In our experimental protocol, we…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 3

Strengths

1. This work investigates an interesting problem. Based on some assumptions, it proposes a new theory and validates this theory with experiments. The work also does a good job comparing the proposed scaling law to prior works. 2. I appreciate that the experiments include both toy settings as well as real world LLMs.

Weaknesses

The paper primarily focuses on transformers. However, it would be interesting to see some experiments for e.g., state space models, and whether ICL is bayesian in these models as well.

Reviewer 02Rating 5Confidence 2

Strengths

Overall, this is a very interesting paper, as the Bayesian scaling law offers insights into phenomena such as many-shot prompt jailbreaking. If the conclusions of this work hold, they could strengthen the argument that ICL operates in a Bayesian manner.

Weaknesses

My primary concern lies with the functional form of the posterior expectation in Equation (2). Specifically, in Equation (2), over which random variables is the expectation taken? The notation used here is somewhat unclear to me. Additionally, in Equation (17) of the appendix, the authors appear to apply the linearity of expectation, which seems to assume that E(A/B)=E(A)/E(B). Am I missing something here?

Reviewer 03Rating 6Confidence 4

Strengths

Overall, the paper is well motivated, well written and clear. The experiments are all asking very important questions. Here is a list of strengths: Contributions: C1: The result clearly separating SFT’s effect on the prior vs in each distribution is interesting and useful to understand how ICL abilities are affected by SFT. C2: SFT being increasingly superficial (prior change only) on larger models is interesting, and suggests an important phenomena to study as we scale models. C3: In gener

Weaknesses

Even though the paper is great at a conceptual level, unfortunately, many claims are weakly justified. I think the paper has enough qualitative contributions and does not require a justification of competence by scaling law “benchmarking”. In fact, I don’t think the paper’s contribution is degraded at all even if the fits went worse, as the importance of this paper, for me, is that the authors developed a framework to fit a scaling law and interpret its terms, allowing qualitative insights of ho

Code & Models

Repositories

aryamanarora/bayesian-laws-icl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsShrink and Fine-Tune