Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer

Yihe Dong; Lorenzo Noci; Mikhail Khodak; Mufan Li

arXiv:2506.01115·cs.LG·September 5, 2025

Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer

Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

PDF

TL;DR

This paper investigates the role of attention in transformers, showing that even with random or frozen attention components, models can perform competitively, highlighting the architecture's inherent inductive biases and the distinct contributions of its parts.

Contribution

The study demonstrates that frozen or random attention mechanisms can still enable transformers to perform well, and formalizes this with new expressivity results and a novel architecture, MixiT.

Findings

01

Frozen attention can form induction heads and perform language modeling.

02

Random attention with stable signal propagation is effective in deep transformers.

03

Attention is key for in-context reasoning, while MLPs aid knowledge storage.

Abstract

The transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of tasks - including mathematical reasoning, memorization, and retrieval - using only gradient-based learning on next-token prediction. While the core component of a transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard transformers to variants in which either the MLP layers or the attention weights are frozen at initialization. Surprisingly, we find that attention with frozen key and query weights is not only able to form induction heads, but can also perform competitively on language modeling. We formalize this by proving a new expressivity result for transformer models with frozen key and query weights. To further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.