Curriculum-Guided Layer Scaling for Language Model Pretraining
Karanpartap Singh, Neil Band, Ehsan Adeli

TL;DR
This paper introduces Curriculum-Guided Layer Scaling (CGLS), a training strategy that progressively increases model depth and data difficulty during language model pretraining, leading to improved efficiency and performance.
Contribution
CGLS is a novel framework that synchronizes model growth with data complexity, enhancing pretraining efficiency and generalization in large language models.
Findings
CGLS outperforms baseline methods on QA benchmarks at 100M parameters.
Progressive layer stacking improves zero-shot performance at 1.2B scale.
Increasing model depth with data difficulty enhances generalization.
Abstract
As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e. gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Pretraining at the 1.2B scale, we stratify the DataComp-LM corpus with a DistilBERT-based classifier and progress from general text to highly technical or specialized content. Our results show that…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The integration of curriculum learning and progressive layer scaling into a unified pretraining framework is novel for LLM pretraining. The approach is conceptually sound and aligns well with cognitive principles of gradual learning. 2. The experiments are thorough and well-validated across multiple model scales, with replication and ablation studies supporting the claims. Results show consistent and meaningful gains on reasoning-focused benchmarks such as PIQA and ARC, demonstrating improve
1. Most experiments focus on reasoning and question-answering benchmarks. Broader capabilities such as dialogue generation, summarization, or open-ended tasks are not evaluated, leaving the generality of the approach less explored. 2. The effectiveness of CGLS relies on the accuracy of the DistilBERT-based difficulty classifier for data stratification, which may not generalize well to other languages, domains, or modalities. 3. The method is only tested on text-based pretraining; no experiments
1. The core design of the paper is quite interesting and can provide some inspiration for the field. 2. Write clearly and easily understandable.
1. The downstream evaluation focuses heavily on multiple-choice QA tasks (PIQA, ARC). While these are standard, they don't fully capture the breadth of LLM capabilities. The evaluation would be strengthened by including math benchmarks (e.g., GSM8K, AIME 2024, AIME 2025…), and coding benchmarks (e.g., Humaneval…). 2 . The experiments only reach 1B parameters and are confined to the Llama architecture, failing to demonstrate scalability to modern 7B+ models or generalization across diverse archi
1. The research direction of progressively increasing the model size is interesting and worth exploring.
1. Overall, the novelty and insight provided by this paper is limited. Various settings used in the experiments seem arbitrary which leads to concern about scalability. 2. The comparison and the improvement over the baseline seem not to be very solid. The setup for the curriculum learning baseline is not clearly specified, while there are various stronger baselines in this category in the literature. For 700M and 2.5B token setting, the improvement over the curriculum learning baseline is slight
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques
