Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao; Yufei Gu; Shitong Shao; Xun Zhou; Liang Xiang; Zeke Xie

arXiv:2602.05393·cs.CL·February 6, 2026

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie

PDF

Open Access 3 Reviews

TL;DR

The paper introduces a Late-to-Early Training (LET) paradigm that leverages late-layer knowledge from pretrained models to accelerate and improve the training of larger language models, reducing computational costs and enhancing performance.

Contribution

It proposes a novel LET approach that guides early training stages using late-layer representations, significantly speeding up training and boosting model capabilities.

Findings

01

Achieves up to 1.6× training speedup on 1.4B models

02

Improves downstream task accuracy by nearly 5%

03

Effective with pretrained models much smaller than the target model

Abstract

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* The core mechanisms are clear (late-to-early-step / late-to-early-layer). It formalizes two mechanisms, using a teacher’s late-layer representations to guide a student’s early layers, and applying this guidance only in early training steps with a decaying schedule, yielding a reproducible training recipe. * The approach is architecture-agnostic and effective with small teachers. Because alignment is performed on hidden states, the method imposes minimal architectural constraints and remains e

Weaknesses

* There is a dependence on teacher quality. Although LET works with small teachers, using weak or domain-mismatched teachers may inject harmful biases into the early layers, potentially leading to negative distillation effects. It would be better to also discuss the situations in which the proposed method does not work well. * The breadth and strictness of baselines could be improved. While several baselines are covered, more stringent comparisons under identical token/compute/data budgets with

Reviewer 02Rating 4Confidence 4

Strengths

This paper demonstrates that knowledge distillation (KD) onto a large model is possible using a teacher model that is 10x smaller. By applying KD only during the initial phase of pre-training, not every step, the computational cost does not persist throughout the entire training process . The paper presents results showing that this method achieves higher performance compared to standard training without knowledge distillation .

Weaknesses

- **Insufficient Baseline Comparisons:** The paper compares against **standard training** and **RKD**, but omits head-to-head evaluations with **large-teacher, logits-based KD**, strong **offline KD** pipelines, and recent **data-selection / model-growth** accelerators. Adding **wall-clock–normalized** and **peak-VRAM–normalized** comparisons to these families would more clearly position LET. - **Size of the L2E Advantage:** While Figures 3–4 **suggest** L2E > L2M/L2L, the **visual gap

Reviewer 03Rating 8Confidence 3

Strengths

* The writing is very clear * The experiments are comprehensive with well designed ablation studies * The reported performance of the method is significant ( "[the method] exceeds the baseline’s average performance while requiring less than 67% of the training steps even with 10× smaller model") * The method does not require architectural compatibility between the student and the teacher

Weaknesses

* While I am not seeing this as a significant weakness (because of the detailed experimental evidences) the proposed method is lacking theoretical backing.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Domain Adaptation and Few-Shot Learning