Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Hoyoon Byun; Youngjun Choi; Taero Kim; Sungrae Park; Kyungwoo Song

arXiv:2601.09719·cs.CL·February 4, 2026

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

PDF

Open Access 3 Reviews

TL;DR

The paper introduces Bounded Hyperbolic Tanh (BHyT), a normalization method that enhances stability and efficiency in large language model training by controlling activation growth and reducing computational overhead.

Contribution

BHyT provides a stable, efficient alternative to Pre-LN by coupling tanh nonlinearity with data-driven input bounding and a lightweight variance approximation.

Findings

01

BHyT achieves 15.8% faster training on average.

02

It provides 4.2% higher token throughput.

03

It maintains or improves performance on language benchmarks.

Abstract

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Clear motivation. - Conceptually simple yet theoretically grounded design. - Lightweight variance approximation for efficiency. - Empirical evidence of improved stability.

Weaknesses

- **Inadequate reporting and questionable generality of Peri-LN throughput results.** Figure 4(b) claims that Peri-LN achieves strong accuracy but suffers from the slowest throughput, positioning BHyT as the best trade-off. However, the paper does not specify the environment under which throughput was measured. All experiments were conducted in Llama-Factory rather than in standard large-scale frameworks (e.g., Megatron-Core or NeMo) that officially support Gemma-style Peri-LN. Without such meta

Reviewer 02Rating 6Confidence 4

Strengths

- Combine the speed advantage of "unnormalized" with the stability of "bounded non-saturated" to explicitly avoid tanh saturation with probability limiting. - The method is simple and effective.

Weaknesses

- Only validated on 1B/3B, C4, and a small amount of SFT datasets; did not cover deeper layers or longer contexts. I understand that it is not realistic to do such a thing with limited resources, but perhaps training a narrow and deep model might be feasible? - The "uniform" attention assumption is least tenable in which training stages/tasks? If the second moment of the real attention weights replaces the uniform assumption, what are the approximate errors and throughput losses?

Reviewer 03Rating 4Confidence 4

Strengths

1 Creative combination of normalization‑free activations with explicit, data‑aware input bounding and a block‑level variance approximation; the latter gives a neat knob to keep overhead small while preserving stability 2 Clear derivations, concise definitions, readable pseudocode, 3 If the approach scales, it could reduce normalization overhead in large LMs without sacrificing training stability

Weaknesses

Please see the questions.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods