On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Nghiem T. Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy M. H. Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

TL;DR
This paper provides a theoretical foundation for zero-initialized attention in LLMs, linking it to mixture-of-expert models and demonstrating its effectiveness with linear and non-linear prompts through empirical validation.
Contribution
It offers the first rigorous theoretical analysis of zero-initialized attention, connecting it to mixture-of-expert models and exploring optimal prompt and gating factor estimation.
Findings
Non-linear prompts outperform linear prompts in experiments.
Zero-initialized attention surpasses vanilla attention even with limited data.
Theoretical connection established between zero-initialized attention and mixture-of-expert models.
Abstract
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEEG and Brain-Computer Interfaces
MethodsSoftmax · Attention Is All You Need · LLaMA
