The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms
Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura

TL;DR
This paper provides a theoretical foundation for the strong lottery ticket hypothesis in multi-head attention mechanisms of transformers, showing that high-performing subnetworks exist and can approximate arbitrary MHAs with high probability.
Contribution
It introduces the first theoretical analysis of SLTs in MHAs and extends the SLTH to transformers without normalization layers.
Findings
SLTs exist in randomly initialized MHAs with high probability
Approximation error decreases exponentially with hidden dimension
Theory validated through empirical experiments
Abstract
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of heads and input dimension has the hidden dimension for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
