Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Zhiqi Bu

arXiv:2511.04981·cs.LG·November 10, 2025

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Zhiqi Bu

PDF

Open Access

TL;DR

This paper introduces a progressive training method that incrementally increases model depth, significantly reducing computational costs while maintaining high accuracy, demonstrated on large models like GPT2.

Contribution

It proposes zero/one-layer progressive training, providing a practical approach to scale model depth efficiently with theoretical insights on initialization and hyperparameters.

Findings

01

Achieves approximately 80% compute savings on GPT2.

02

Accelerates training by about 5 times with minimal loss degradation.

03

Provides theoretical insights into depth expansion and training dynamics.

Abstract

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up model capacity during training, hence significantly reducing computation with little to none performance degradation. In this work, we study the depth expansion of large models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training for the optimal tradeoff between computation and loss. For example, zero/one-layer progressive training on GPT2 can save $\approx 80%$ compute, or equivalently accelerate $\approx 5 \times$ while achieving almost the same loss,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning