Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer
Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li

TL;DR
This paper investigates the role of attention in transformers, showing that even with random or frozen attention components, models can perform competitively, highlighting the architecture's inherent inductive biases and the distinct contributions of its parts.
Contribution
The study demonstrates that frozen or random attention mechanisms can still enable transformers to perform well, and formalizes this with new expressivity results and a novel architecture, MixiT.
Findings
Frozen attention can form induction heads and perform language modeling.
Random attention with stable signal propagation is effective in deep transformers.
Attention is key for in-context reasoning, while MLPs aid knowledge storage.
Abstract
The transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of tasks - including mathematical reasoning, memorization, and retrieval - using only gradient-based learning on next-token prediction. While the core component of a transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard transformers to variants in which either the MLP layers or the attention weights are frozen at initialization. Surprisingly, we find that attention with frozen key and query weights is not only able to form induction heads, but can also perform competitively on language modeling. We formalize this by proving a new expressivity result for transformer models with frozen key and query weights. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
