Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

Haitz S\'aez de Oc\'ariz Borde

arXiv:2507.02944·cs.LG·November 11, 2025

Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

Haitz S\'aez de Oc\'ariz Borde

PDF

Open Access

TL;DR

This paper explores the theoretical and empirical benefits of multi-head attention in Transformers, revealing how multiple heads can synergistically improve information flow beyond simple parallelism.

Contribution

It introduces a novel framework viewing multi-head attention as synergistic computational graphs and provides theoretical analysis and empirical validation of its advantages.

Findings

01

Multi-head attention enhances information propagation.

02

Synergistic effects lead to faster mixing times.

03

Empirical results confirm theoretical predictions.

Abstract

Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Functions and Memory