TL;DR
This paper explores a modular, layer-wise expansion training regime for decoder-only Transformers with a fixed token interface, demonstrating continued learning with limited active parameters.
Contribution
It introduces a feasible method for growing Transformer models layer-by-layer while keeping the token interface fixed, reducing active parameters compared to monolithic models.
Findings
Continued growth is viable even with minimal token interfaces.
A 269.7M parameter model trained on a fixed interface achieved 28.92% MMLU.
The approach offers a tradeoff between final perplexity and active parameter budget.
Abstract
We study a constrained training regime for decoder-only Transformers in which the token interface is fixed, previously trained dense blocks are not reopened, and the active trainable parameter set is kept approximately constant as depth grows. Starting from a shallow model, we stack new blocks and train only the newest blocks and the LM head; optional LoRA phases provide limited global readjustment under the same active-parameter budget. The paper asks a feasibility/tradeoff question, not whether this regime matches tuned monolithic pretraining. In a common-protocol 9-layer study on a frozen Unicode substrate, the constructive frozen-Unicode model uses 105.0M active trainable parameters, compared with 180.5M for the interface-matched monolithic frozen baseline and 247.6M for the fully trainable monolithic baseline. We then consider an extreme fixed interface: each token is represented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Bochkov/bvv241-2-3model· 1 dl1 dl
- 🤗Bochkov/bvv241-maxmodel· 2 dl2 dl
- 🤗Bochkov/bvv241-nemomodel· 1 dl1 dl
- 🤗Bochkov/bvv241-absmodel· 1 dl1 dl
- 🤗Bochkov/emergent-semantics-model-uni-glyph-335mmodel· 4 dl4 dl
- 🤗Bochkov/emergent-semantics-model-unfrozen-335mmodel· 2 dl2 dl
- 🤗Bochkov/emergent-semantics-model-1024-bit-335mmodel· 4 dl4 dl
- 🤗Bochkov/emergent-semantics-model-256-bit-285mmodel· 1 dl1 dl
- 🤗Bochkov/emergent-semantics-model-64-bit-272mmodel· 1 dl1 dl
- 🤗Bochkov/emergent-semantics-model-16-bit-269mmodel· 5 dl· ♡ 15 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
