Leaner Transformers: More Heads, Less Depth
Hemanth Saratchandran, Damien Teney, Simon Lucey

TL;DR
This paper challenges the notion that larger transformers are always better by demonstrating that increasing attention heads and reducing depth can maintain or improve performance while decreasing model size across multiple tasks.
Contribution
It introduces a theoretical principle that highlights the benefits of more attention heads for better conditioning, enabling shallower, more efficient transformer architectures.
Findings
Increased heads improve attention conditioning.
Shallower models with more heads match or outperform deeper models.
Parameter reduction of 30-50% without accuracy loss.
Abstract
Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means better", leading to ever-increasing model sizes. This paper challenge this ideology by showing that many existing transformers might be unnecessarily oversized. We discover a theoretical principle that redefines the role of multi-head attention. An important benefit of the multiple heads is in improving the conditioning of the attention block. We exploit this theoretical insight and redesign popular architectures with an increased number of heads. The improvement in the conditioning proves so significant in practice that model depth can be decreased, reducing the parameter count by up to 30-50% while maintaining accuracy. We obtain consistent benefits…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The theoretical idea is fresh and intuitive. It offers a new perspective on why multi-head setups are easier to train than single-head variants. - The empirical validation is broad. I like that the authors show the results in both vision and language tasks. The Imagenet and GLUE results are convincing. The savings in memory and parameters are impressive. - Many transformer related papers are technically heavy and having unclear design guidance. But this paper contributes a simple rule of thum
- The theoretical section assumes independence and isotropy among heads, which probably doesn't hold in trained models. It would help to at least show empirical correlation plots or conditioning statistics from real networks to bak this up. - The experiments are all in the mid-scale regime (ViT-B/L, BERT-base size). I would have liked to see a preliminary results on a > 1B-parameter model to check whether the rule survives modern LLM training dynamics. - I'm not entirely convinced the performan
1. The direction of reducing the depth of vision transformers seems interesting and indeed worths exploration. The depth of deep networks plays a more essential role compared to width in terms of practical efficiency and it is appreciated that the authors have provided theoretical insights to validate the proposed scheme. 2. The writing is clear and the paper is easy to follow. The overall structure is well organized and the idea is presented in a coherent manner. 3. It is appreciated that the
1. The technical contribution of the paper seems limited. The core idea, i.e., more heads and fewer layers leads to better trade-off between efficiency and accuracy, looks like the empirical finding which can hardly be regarded a systematic methodology. Although the authors have provided theoretical insights, it does not provide a systematic guideline of how to adjust the trade-off between heads and layers. Is the trade-off chosen from empirical results per architecture? 2. While the rationale h
1. The mathematical and empirical analysis is clearly presented 2. The mathematical analysis generates interesting insights which then strongly motivate the architectural changes explored 3. The empirical analysis is extensive, covering several different domains and task types 4. The empirical analysis is detailed, aiding reproducibility, and conducts lots of different experiments to validate the hypotheses of the paper 5. The empirical results are strong, and give good evidence to the hypothese
1. The authors may wish to cite further prior empirical work showing that one can make this tradeoff between attention head count and layer size while still preserving accuracy, such as https://arxiv.org/abs/2210.00640, or https://proceedings.neurips.cc/paper_files/paper/2023/hash/3504a4fa45685d668ce92797fbbf1895-Abstract-Conference.html 2. Given the dominance of scaling GPT-style transformers and language-modelling in today's AI landscape, it would be interesting to see how the improvements fro
- Provides a mathematically grounded reinterpretation of multi-head attention’s role in optimization. - Consistent empirical validation across domains (vision, NLP, long-sequence). - Strong practical message toward leaner Transformer design. - The combination of theoretical insight and empirical validation across different domains makes the findings broadly relevant and practically actionable.
- Core assumptions (independent Gaussian heads, isotropy, bounded singular values) are unrealistic during actual training; their validity beyond initialization is unclear. - The theoretical link between Jacobian conditioning and global optimization/generalization remains unproven. - Experiments are limited to mid-scale models (<200M parameters); it remains unclear whether the same depth-head trade-off persists under large-scale pretraining or diverse data distributions.
### Strengths The paper have multiple strengths: * **Clarity in Presentation:** The main results are presented clearly and effectively, with key findings highlighted in well-designed boxes that make them easy to follow and visually accessible. * **Jacobian Analysis:** The analysis of the Jacobian matrix of the attention mechanism is a valuable and insightful direction. It provides a deeper understanding of the conditioning of the attention matrix and aligns well with the objectives outlined in
While the experimental setup in the paper is extensive, there are several notable shortcomings: **1) Simplistic Preliminary Section** The preliminary section is overly simplified. For instance, *Equation (2)* omits the layer normalization steps between layers. Additionally, in the paragraph following *Equation (3)*, the similarity metric for attention is denoted as $\phi$, which should ideally be expressed as $\phi(q, k)$. However, the paper describes it as *softmax($\phi(q, k)$)*, a formulatio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsManufacturing Process and Optimization
