Task Diversity Shortens the ICL Plateau

Jaeyeon Kim; Sehyun Kwon; Joo Young Choi; Jongho Park; Jaewoong Cho; Jason D. Lee; Ernest K. Ryu

arXiv:2410.05448·cs.LG·August 13, 2025

Task Diversity Shortens the ICL Plateau

Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, Ernest K. Ryu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper shows that training language models on diverse in-context learning tasks shortens the long learning plateaus, making each task easier to learn and potentially explaining the success of large-scale models.

Contribution

It reveals that task diversity during training reduces loss plateaus in ICL, contradicting prior intuition about complexity and facilitating easier learning.

Findings

01

Diverse ICL tasks shorten the loss plateau duration.

02

Training on multiple tasks improves learning efficiency.

03

Task diversity may contribute to large-scale model success.

Abstract

In-context learning (ICL) describes a language model's ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

The paper showcases a good number of experiments with solid results. I also appreciate the explanation of the experimental setup, and detailed experiments with multiple tasks and models. The authors also present a good analysis of some possible explanations for the results achieved. The presentation is great, and so is the writing.

Weaknesses

The most significant weakness of the work is the construction of the single-task baselines. In my opinion, this issue alone should justify the rejection of the paper. In short - **the experiments compare single-task models trained with batch size B with multi-task models (with k tasks) with batch size kB - k times larger batch size, and k times larger the training set/compute cost!** Then, the authors use the total number of iterations (batches) to compare models. While this is mentioned in the

Reviewer 02Rating 6Confidence 3

Strengths

The manuscript very well written; the division of the paper makes it very intuitive to follow and the plots/tables are informative. The main findings are also quite interesting in how they present a setting where (if all holds) then learning various tasks at one can potentially improve the efficiency of learning a marginally or completely un-related task. The notion of measuring how the loss plateau contracts is a rather novel way of investigation and I see avenues where it could be useful for e

Weaknesses

Some points made by the authors do not have completely solid grounding in the results or lack some support. For example, on L16 and L199-202, the authors state that "multi-task ICL is easier to learn than single-task ICL is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process". I'm not particularly convinced of this notion, at least in the regime within which the authors present their results. The sample is bro

Reviewer 03Rating 3Confidence 4

Strengths

This paper is clearly written. The figures and captions are easy to follow, and the motivation is clear.

Weaknesses

The primary conclusion of this paper, “Task diversity shortens the ICL plateau,” may appear somewhat trivial. The study demonstrates that training models with multiple tasks simultaneously involves using significantly more data (i.e., training with n tasks simultaneously implies utilizing n times the amount of training data). Though these data sets pertain to different in-context learning (ICL) tasks, the observed improvements in learning efficiency could stem from the increased data volume rath

Code & Models

Repositories

sehyunkwon/task-diversity-icl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnalytical Chemistry and Sensors · Advanced Memory and Neural Computing · Semiconductor materials and devices

MethodsSparse Evolutionary Training