Larger language models do in-context learning differently

Jerry Wei; Jason Wei; Yi Tay; Dustin Tran; Albert Webson; and Yifeng Lu; Xinyun Chen; Hanxiao Liu; Da Huang; Denny Zhou and; Tengyu Ma

arXiv:2303.03846·cs.CL·March 9, 2023·99 cites

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, and Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou and, Tengyu Ma

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how larger language models perform in-context learning by examining their ability to override semantic priors and learn input-label mappings, revealing that these capabilities emerge primarily with increased scale and are enhanced by instruction tuning.

Contribution

The study demonstrates that the ability to override semantic priors and learn input-label mappings in in-context learning emerges mainly with larger models and is improved by instruction tuning.

Findings

01

Large models can override semantic priors with contradictory in-context exemplars.

02

Large models can perform linear classification in semantically-unrelated label settings.

03

Instruction tuning enhances both semantic prior use and input-label mapping learning.

Abstract

We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 1Confidence 5

Strengths

The paper is very interesting and argues hot topics however although the experiments are exhaustive they are not best introduced and could be discussed in more detail.

Weaknesses

The paper does not expose scientific novelty. Although the experiments and the (slightly crude) discussion are good, these experiments or something similar has been presented here before: https://neurips.cc/virtual/2023/76728. I would be grateful to the authors if they could highlight the new paper's substantial improvements.

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper is well-written. The major analysis point is clear and the experimental design makes sense to me. A large number of experiments are conducted in this paper to support the point that the ability to override the semantic prior trained during the pretraining process and learn new input-label mappings from the context emerges with larger scales. 2. The analysis part about instruction-tuned models that they are worse at overriding the semantic priors is quite surprising and could infor

Weaknesses

1. As mentioned in the limitations, more experiments on the generation tasks could make this paper stronger. One possible way would be inserting wrong/different facts from the semantic prior and see if the model would respond based on the newly inserted facts. 2. I’m curious if the conclusions about the sizes would still stand in the newly trained LLMs with better structures and training corpus. It is difficult to tell if this behavior is really directly related to the model sizes or if it coul

Reviewer 03Rating 6Confidence 4

Strengths

The study provides a lot of data regarding their research questions. A wide variety of models and data are tested, and the analysis is both focused and rich. Some may say that the results are “obvious” given our experience of ICL, but the study seems to be thorough and if published will serve as evidence for arguments about ICL and scale which are currently made based on anecdotes.

Weaknesses

Some in the ICLR community may find the results too obvious to justify giving them space at the conference.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsPathways Language Model