Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Chen Chen; Lai Wei

arXiv:2601.19895·cs.LG·February 2, 2026

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Chen Chen, Lai Wei

PDF

Open Access

TL;DR

This paper demonstrates that a modified Post-LayerNorm Transformer, called Keel, with Highway-style connections, enables stable training of extremely deep language models, surpassing Pre-LN in depth scalability and performance.

Contribution

The authors introduce Keel, a Post-LayerNorm Transformer with Highway connections, allowing stable training of models over 1000 layers without complex tricks.

Findings

01

Keel trains reliably at depths over 1000 layers.

02

Keel outperforms Pre-LN in perplexity and depth-scaling.

03

Post-LN with Highway connections enables extremely deep LLMs.

Abstract

Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Topic Modeling