Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Hidetaka Kamigaito; Ying Zhang; Jingun Kwon; Katsuhiko Hayashi; Manabu Okumura; Taro Watanabe

arXiv:2505.24009·cs.CL·June 10, 2025

Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Hidetaka Kamigaito, Ying Zhang, Jingun Kwon, Katsuhiko Hayashi, Manabu Okumura, Taro Watanabe

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how the diversity of layers within Transformer models affects their performance, revealing that diversity and the number of layers play crucial roles in parameter scaling laws and task accuracy.

Contribution

The study introduces a bias-diversity decomposition for Transformer layers and demonstrates the importance of layer diversity for performance improvements, supported by theoretical analysis and empirical validation.

Findings

01

Layer diversity is critical for Transformer performance.

02

Adding layers improves performance only when layers are diverse.

03

Performance gains diminish with more layers, following a logarithmic pattern.

Abstract

Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii)…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Introduces a bias–diversity decomposition framework to analyze Transformer layers, offering a novel theoretical lens linking internal mechanisms to scaling laws. 2. Bridges mechanistic interpretability and empirical scaling-law studies, a connection rarely explored. 3. Strong theoretical grounding with multiple theorems supported by rigorous proofs. 4. Empirical validation across diverse LLM families and NLP benchmarks strengthens claims. 5. The paper is clearly structured, with smooth transi

Weaknesses

1. While the theory links diversity to performance, concrete architectural or training recommendations (e.g., diversity regularization or pruning criteria) are missing. Demonstrating such interventions could make the work more actionable. 2. Experiments focus only on NLP benchmarks. Testing on multimodal or vision Transformers could demonstrate broader applicability and confirm whether the diversity–scaling relationship generalizes beyond text models. 3. Although the work aims to connect interpr

Reviewer 02Rating 2Confidence 3

Strengths

The paper introduces an original bias–diversity decomposition framework to connect internal Transformer dynamics with scaling laws, offering theoretical insight. The work presents some theorems to formalize key relationships: (1) the decomposition of prediction error into bias and diversity; (2) the trade-off between bias and diversity; (3) the submodularity of performance gains from adding layers Comprehensive experiments across multiple LLMs (e.g., LLaMA, Phi, Mistral) and NLP benchma

Weaknesses

1. Theorems 5 and 6 indicate that increasing the number of model layers does not necessarily improve performance. This paper does not provide relevant experimental examples to illustrate this phenomenon. Both Conditional Redundancy and Redundancy increase monotonically, but their difference may still increase or decrease monotonically. 2. Theorem 7 assumes that when U_i are independent, the improvement of model performance is submodular. This paper doesn’t explain the rationality of the indepen

Reviewer 03Rating 4Confidence 4

Strengths

1. A key strength of this work is its novel theoretical framework for bias-diversity decomposition, which it tailors specifically to the Transformer's residual stream. While grounded in ensemble learning, the formulation (Theorem 1, Eq. 8) provides an original and principled lens to quantify the contributions of individual layers and modules (e.g., attention, MLP) in granular, interpretable terms. 2. A particularly deep theoretical contribution is the reframing of diversity via information-theor

Weaknesses

1. The notation used throughout the paper is inconsistent and poorly standardized. Many symbols are not clearly defined upon their first appearance. The authors should provide a unified and clear definition for all symbols used in the theorems and mathematical formulae. 2. The results in Figures 3 and 5 do not strongly support the authors' viewpoint. In Figure 3, there is no clear correlation shown between bias, diversity, and accuracy, while the strong correlation between MSE and accuracy is no

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPower Transformer Diagnostics and Insulation · Magnetic Properties and Applications · High voltage insulation and dielectric phenomena

MethodsFocus