Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

arXiv:2507.07129·cs.LG·May 5, 2026

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

A. Bochkov

PDF

1 Repo 16 Models

TL;DR

This paper explores a modular, layer-wise expansion training regime for decoder-only Transformers with a fixed token interface, demonstrating continued learning with limited active parameters.

Contribution

It introduces a feasible method for growing Transformer models layer-by-layer while keeping the token interface fixed, reducing active parameters compared to monolithic models.

Findings

01

Continued growth is viable even with minimal token interfaces.

02

A 269.7M parameter model trained on a fixed interface achieved 28.92% MMLU.

03

The approach offers a tradeoff between final perplexity and active parameter budget.

Abstract

We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, we stack new blocks and train only the newest blocks and the LM head; optional LoRA phases provide limited global readjustment under the same active-parameter budget. The paper asks a feasibility/tradeoff question, not whether this regime matches tuned monolithic pretraining. In a common-protocol 9-layer study on a frozen Unicode substrate, the constructive frozen-Unicode model uses 105.0M active trainable parameters, compared with 180.5M for the interface-matched monolithic frozen baseline and 247.6M for the fully trainable monolithic baseline. We then consider an extreme fixed interface: each token is represented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

avbochkov/PGT
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.