On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model
Seongjin Shin, Sang-Woo Lee, Hwijeen Ahn, Sungdong Kim, HyoungSeok, Kim, Boseop Kim, Kyunghyun Cho, Gichang Lee, Woomyoung Park, Jung-Woo Ha,, Nako Sung

TL;DR
This paper analyzes how the source and size of pretraining data affect in-context learning in large language models, revealing that domain relevance and data combination influence performance more than corpus size or perplexity.
Contribution
It provides an in-depth analysis of pretraining corpus effects on in-context learning, highlighting factors beyond size and perplexity that impact performance.
Findings
In-context learning depends heavily on corpus domain source.
Combining multiple corpora can enable in-context learning even if individual corpora do not.
Pretraining on task-related data does not always improve in-context learning for that task.
Abstract
Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combination of multiple corpora, even when each corpus does not result in in-context learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Multi-Head Attention
