Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li; Lu Yin; Shiwei Liu

arXiv:2412.13795·cs.LG·August 5, 2025

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li, Lu Yin, Shiwei Liu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Mix-LN is a novel normalization method that combines Pre-LN and Post-LN to improve gradient flow and training effectiveness in deep language models, leading to better performance and utilization of all layers.

Contribution

We introduce Mix-LN, a hybrid normalization technique that enhances gradient norms across layers, addressing training shortfalls in deep LLMs caused by traditional normalization methods.

Findings

01

Mix-LN outperforms Pre-LN and Post-LN across various model sizes.

02

Models trained with Mix-LN learn better during fine-tuning and RLHF.

03

Mix-LN promotes more balanced and healthier gradients throughout the network.

Abstract

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The observation related to gradient norm brought by LN is interesting, clear, and straight to the point. - Part of the quantitative evaluation is conducted to large-scale LLMs, making this work actual. - All the empirical evaluation is quite realistic and likely correct. - Figures in general do an excellent job in providing a display on the distributions.

Weaknesses

- The relative gains in terms of perplexity/accuracy for larger model slims down. This in a certain sense contradicts the vanishing gradient argument. Besides, the gap compared to pre-LM becomes slower and slower. - The success of the approach depends on $\alpha$: there are cases like Llama-1B for BoolQ in which only pre-LM performs better, indicating that tuning $\alpha$ properly can be determinant. - Metrics are not always complete: I could not find, for example, performance drop when using Mi

Reviewer 02Rating 5Confidence 4

Strengths

- Novel Approach to Gradient Flow in LLMs: Mix-LN’s hybrid approach to layer normalization is unique, attempting to combine the advantages of Pre-LN and Post-LN across different model depths. - Good Experimental Validation: The paper includes extensive experimentation on multiple model sizes and tasks, showing consistent improvement with Mix-LN, particularly in mid-sized models. - Improved Training Stability: Mix-LN appears to mitigate some of the training instability issues commonly observed

Weaknesses

- Lack of Theoretical Rigor: The paper does not provide a detailed theoretical framework to explain why Mix-LN achieves balanced gradient dynamics across layers, which is crucial for the paper’s validity. In the well-explored field of normalization for LLMs, incremental changes require more substantial theoretical backing to make a notable contribution. - **Limited Comparison** with State-of-the-Art Normalization Techniques: The paper lacks direct comparison with recent normalization methods, s

Reviewer 03Rating 8Confidence 3

Strengths

1. This paper is very well-organized. 2. This paper introduce the problem with theortical anaysis and strengthen with experiments, which makes the problem clear and well-motivated. 3. Comprehensive experiments for presenting the effect of Mix-LN. 4. The anaysis are insightfull.

Weaknesses

1. This figures of angular distance are somewhat non-intuitive. Can't understand on the first sight. Block size more like describing the size of the model than distance between two layers under comparison. The first detailed description is actucally in line 402-403. 2. Typo problem: Figure 3-c (in main body) or Figure 3-e (in caption of Figure 3).

Code & Models

Repositories

pixeli99/mixln
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Neural Networks and Reservoir Computing

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Residual Connection · Linear Layer · Weight Decay · Cosine Annealing · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Softmax · Attention Dropout