Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models
Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

TL;DR
The paper introduces Bounded Hyperbolic Tanh (BHyT), a normalization method that enhances stability and efficiency in large language model training by controlling activation growth and reducing computational overhead.
Contribution
BHyT provides a stable, efficient alternative to Pre-LN by coupling tanh nonlinearity with data-driven input bounding and a lightweight variance approximation.
Findings
BHyT achieves 15.8% faster training on average.
It provides 4.2% higher token throughput.
It maintains or improves performance on language benchmarks.
Abstract
Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once…
Peer Reviews
Decision·Submitted to ICLR 2026
- Clear motivation. - Conceptually simple yet theoretically grounded design. - Lightweight variance approximation for efficiency. - Empirical evidence of improved stability.
- **Inadequate reporting and questionable generality of Peri-LN throughput results.** Figure 4(b) claims that Peri-LN achieves strong accuracy but suffers from the slowest throughput, positioning BHyT as the best trade-off. However, the paper does not specify the environment under which throughput was measured. All experiments were conducted in Llama-Factory rather than in standard large-scale frameworks (e.g., Megatron-Core or NeMo) that officially support Gemma-style Peri-LN. Without such meta
- Combine the speed advantage of "unnormalized" with the stability of "bounded non-saturated" to explicitly avoid tanh saturation with probability limiting. - The method is simple and effective.
- Only validated on 1B/3B, C4, and a small amount of SFT datasets; did not cover deeper layers or longer contexts. I understand that it is not realistic to do such a thing with limited resources, but perhaps training a narrow and deep model might be feasible? - The "uniform" attention assumption is least tenable in which training stages/tasks? If the second moment of the real attention weights replaces the uniform assumption, what are the approximate errors and throughput losses?
1 Creative combination of normalization‑free activations with explicit, data‑aware input bounding and a block‑level variance approximation; the latter gives a neat knob to keep overhead small while preserving stability 2 Clear derivations, concise definitions, readable pseudocode, 3 If the approach scales, it could reduce normalization overhead in large LMs without sacrificing training stability
Please see the questions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
