The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

Hikari Otsuka; Daiki Chijiwa; Yasuyuki Okoshi; Daichi Fujiki; Susumu Takeuchi; Masato Motomura

arXiv:2511.04217·cs.LG·November 7, 2025

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms

Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical foundation for the strong lottery ticket hypothesis in multi-head attention mechanisms of transformers, showing that high-performing subnetworks exist and can approximate arbitrary MHAs with high probability.

Contribution

It introduces the first theoretical analysis of SLTs in MHAs and extends the SLTH to transformers without normalization layers.

Findings

01

SLTs exist in randomly initialized MHAs with high probability

02

Approximation error decreases exponentially with hidden dimension

03

Theory validated through empirical experiments

Abstract

The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of $H$ heads and input dimension $d$ has the hidden dimension $O (d lo g (H d^{3/2}))$ for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms· underline

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)