Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Sergey Alekseev

arXiv:2604.11890·cs.LG·May 8, 2026

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Sergey Alekseev

PDF

TL;DR

This paper analyzes signal propagation at initialization in normalization-free transformers using the APJN measure, extending the analysis to various architectures and explaining their stability and sensitivity.

Contribution

It introduces an extended APJN analysis for transformers with bidirectional attention and permutation-symmetric inputs, predicting their asymptotic behavior and stability.

Findings

01

LayerNorm transformers exhibit power-law APJN growth.

02

Replacing LayerNorm with tanh-like nonlinearities leads to stretched-exponential APJN growth.

03

The theory explains the sensitivity of DyT and Derf architectures to initialization.

Abstract

We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $tanh$ -like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.