Larger language models do in-context learning differently
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, and Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou and, Tengyu Ma

TL;DR
This paper investigates how larger language models perform in-context learning by examining their ability to override semantic priors and learn input-label mappings, revealing that these capabilities emerge primarily with increased scale and are enhanced by instruction tuning.
Contribution
The study demonstrates that the ability to override semantic priors and learn input-label mappings in in-context learning emerges mainly with larger models and is improved by instruction tuning.
Findings
Large models can override semantic priors with contradictory in-context exemplars.
Large models can perform linear classification in semantically-unrelated label settings.
Instruction tuning enhances both semantic prior use and input-label mapping learning.
Abstract
We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is very interesting and argues hot topics however although the experiments are exhaustive they are not best introduced and could be discussed in more detail.
The paper does not expose scientific novelty. Although the experiments and the (slightly crude) discussion are good, these experiments or something similar has been presented here before: https://neurips.cc/virtual/2023/76728. I would be grateful to the authors if they could highlight the new paper's substantial improvements.
1. This paper is well-written. The major analysis point is clear and the experimental design makes sense to me. A large number of experiments are conducted in this paper to support the point that the ability to override the semantic prior trained during the pretraining process and learn new input-label mappings from the context emerges with larger scales. 2. The analysis part about instruction-tuned models that they are worse at overriding the semantic priors is quite surprising and could infor
1. As mentioned in the limitations, more experiments on the generation tasks could make this paper stronger. One possible way would be inserting wrong/different facts from the semantic prior and see if the model would respond based on the newly inserted facts. 2. I’m curious if the conclusions about the sizes would still stand in the newly trained LLMs with better structures and training corpus. It is difficult to tell if this behavior is really directly related to the model sizes or if it coul
The study provides a lot of data regarding their research questions. A wide variety of models and data are tested, and the analysis is both focused and rich. Some may say that the results are “obvious” given our experience of ICL, but the study seems to be thorough and if published will serve as evidence for arguments about ICL and scale which are currently made based on anecdotes.
Some in the ICLR community may find the results too obvious to justify giving them space at the conference.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsPathways Language Model
