Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Josh Alman; Zhao Song

arXiv:2505.16284·cs.LG·May 23, 2025

Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse

Josh Alman, Zhao Song

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that large weights are essential in attention mechanisms of large language models to prevent layer collapse, which limits expressiveness, and that quadratic time complexity in attention computations is unavoidable for expressive transformers.

Contribution

It establishes that large weights, rather than skip connections, are necessary to avoid layer collapse and maintain expressiveness in attention-based models.

Findings

01

Large weights prevent layer collapse in attention models.

02

Skip connections do not prevent layer collapse with small weights.

03

Quadratic time complexity in attention is unavoidable for expressive transformers.

Abstract

Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are possible. In this paper, we show that large weights are necessary to avoid a strong preclusion to representational strength we call layer collapse, which means that the entire network can be approximated well by a network with only a single layer. Thus, the quadratic running time of attention is unavoidable for expressive transformers. The notion of layer collapse that we introduce is a variant on the notion of rank collapse from the work of [Dong, Cordonnier, and Loukas ICML 2021]. They showed…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 5

Strengths

The idea of this paper is interesting and could, if shown, affect the way models are pre-trained since it posits that using only small weights can lead to layer collapse.

Weaknesses

Minor: The related works are quite confusing; the three paragraphs contain mostly related works (although they sometimes include works on topics that do not relate to the paper, such as privacy), but the next three paragraphs talk about works that are not related to the paper in anyway (such as diffusion or regression models when the paper is about the importance of large weights to avoid layer collapse). This adds half a page of irrelevant text to the paper. Minor/Major: Lemma 4.1 and Lemma 4.

Reviewer 02Rating 4Confidence 2

Strengths

- Provides a clear and rigorous theoretical result on the limits of skip connections under small-weight conditions. - Elegant and well-structured proofs using softmax perturbation and layer-removal arguments. - Corrects a major misconception in prior work (Dong et al., 2021).

Weaknesses

- No empirical validation or quantitative mapping of η to realistic settings. - Purely theoretical, lacks visualizations or verification experiments like Dong et al. (2021). While Dong et al. provided practical insight (“skip connections are necessary”), this paper mainly corrects a misconception without offering clear actionable guidance for model design or training. - The paper also contains unexplained square symbols in several places.

Reviewer 03Rating 0Confidence 4

Strengths

The main motivation of this paper is strong and -- if the main Theorem was proven properly -- this paper would make a great contribution to the field. I believe that studying the "layer collapse" instead of the "rank collapse" (as defined by the authors) makes a lot of sense and could lead to interesting discoveries in the future.

Weaknesses

The paper is poorly structured and full of typos and logical mistakes. Right the first Lemma 4.1 is incorrect and the proof has a mistake in the first inequality. The Lemma states that if $||A-B||\_{\\infty} \\leq \epsilon$, then $||Res(A) - Res(B)||\_{\\infty} \\leq \epsilon$. A counterexample to this is setting $A = \\begin{pmatrix} 2 & 0 & -2 \\end{pmatrix}^T$ and $B = \\begin{pmatrix} 1 & 1 & -3 \\end{pmatrix}^T$; then $Res(A) = A - 0 = A$ and $Res(B) = B - (-1) = \\begin{pmatrix} 2 & 2 & -

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks

MethodsSoftmax · Attention Is All You Need