Attention as a Hypernetwork
Simon Schug, Seijin Kobayashi, Yassir Akram, Jo\~ao Sacramento, Razvan, Pascanu

TL;DR
This paper reformulates multi-head attention as a hypernetwork, revealing a low-dimensional latent code that supports compositional generalization in transformers, especially on abstract reasoning tasks like Raven's Progressive Matrices.
Contribution
It introduces a hypernetwork perspective of attention, demonstrating how latent codes enable generalization to unseen compositions and improving this ability by nonlinear modifications.
Findings
Latent codes predict subtasks in unseen compositions.
Scaling models and data enhances compositional generalization.
Nonlinear value networks improve abstract reasoning performance.
Abstract
Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this…
Peer Reviews
Decision·ICLR 2025 Oral
The paper is well-written and studies an important problem, namely compositional generalisation, from a new angle and with promising empirical results. The writing, structure and related work is outlined clearly, and (almost) all relevant details with regards to the theory and the empirical experiments are available in either the main text or the appendix. The perspective to see multi-head attention as a hypernetwork is novel and might open interesting future avenues for study.
One problem in the empirical evaluation is a certain lack of control over whether the suggested modification (HYLA) is actually increasing compositional generalization or whether it's just a way of increasing the capacity of the model. To this end, it would be interesting to look at the training loss of the models: does HYLA reach the same minimum as linear / softmax attention, or is it already performing much better on the training set (suggesting it is more the capacity increase of the model r
- The paper presents a novel perspective of multi-head attention as a hypernet. More specifically, the perspective that queries and keys dynamically create a function to apply to the values is a straightforward interpretation of the attention mechanism, but the proposed perspective further reveals where the learned compositional structure is potentially exhibited. - The core technical contributions of the "multi-head self-attention as a hypernetwork" perspective and HYLA are presented very clear
The paper seems to consider the special case of self-attention: - The proposed perspective of attention as hypernet is presented with the assumption of self-attention, which is a special case of attention (e.g., cross-attention is more general). While the assumption of self-attention can be seen in figures and equations, it is only explicitly mentioned in the text at line 114. Without clarifying that this perspective and the proposed modification (HYLA) consider the special case of self-attentio
- The idea of viewing multi-head attention as a hypernetwork is interesting. - The experiments on a controlled domain such as symbolic RAVEN test are clear and thorough. - The research problem is of great interest to the ML community. - The paper is well-written and well-organized.
- I may be missing something. But the modification of the linear attention layer seems a little bit restricted. Will other non-linearity functions or $\sigma(\cdot)$ functions other than `RMSHead` also lead to improved compositional generalization ability? It would be great if the authors can comment on this and explain why this modification would work. - I am admittedly not an expert of this domain, and would like to hear my colleague reviewers' thoughts on this.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeuroscience, Education and Cognitive Function · Creativity in Education and Neuroscience
MethodsAttention Is All You Need · Softmax · HyperNetwork · Linear Layer · Multi-Head Attention · Multi-Head Linear Attention
