Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention
Haitz S\'aez de Oc\'ariz Borde

TL;DR
This paper explores the theoretical and empirical benefits of multi-head attention in Transformers, revealing how multiple heads can synergistically improve information flow beyond simple parallelism.
Contribution
It introduces a novel framework viewing multi-head attention as synergistic computational graphs and provides theoretical analysis and empirical validation of its advantages.
Findings
Multi-head attention enhances information propagation.
Synergistic effects lead to faster mixing times.
Empirical results confirm theoretical predictions.
Abstract
Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Functions and Memory
