On the Effect of Pretraining Corpora on In-context Learning by a   Large-scale Language Model

Seongjin Shin; Sang-Woo Lee; Hwijeen Ahn; Sungdong Kim; HyoungSeok; Kim; Boseop Kim; Kyunghyun Cho; Gichang Lee; Woomyoung Park; Jung-Woo Ha,; Nako Sung

arXiv:2204.13509·cs.CL·May 10, 2022

On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok, Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha,, Nako Sung

PDF

Open Access

TL;DR

This paper analyzes how the source and size of pretraining data affect in-context learning in large language models, revealing that domain relevance and data combination influence performance more than corpus size or perplexity.

Contribution

It provides an in-depth analysis of pretraining corpus effects on in-context learning, highlighting factors beyond size and perplexity that impact performance.

Findings

01

In-context learning depends heavily on corpus domain source.

02

Combining multiple corpora can enable in-context learning even if individual corpora do not.

03

Pretraining on task-related data does not always improve in-context learning for that task.

Abstract

Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combination of multiple corpora, even when each corpus does not result in in-context learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Multi-Head Attention