Role of Bias Terms in Dot-Product Attention
Mahdi Namazifar, Devamanyu Hazarika, Dilek Hakkani-Tur

TL;DR
This paper investigates the roles of bias terms in dot-product attention within transformers, revealing that the key bias is redundant, while the value bias is more influential, supported by theoretical analysis and empirical experiments.
Contribution
It provides a mathematical analysis of bias terms in attention modules, showing the key bias is unnecessary and highlighting the importance of the value bias, which is a novel insight.
Findings
Key bias term is redundant and can be omitted.
Value bias has a significant impact on model performance.
Empirical results confirm theoretical analysis across NLP tasks.
Abstract
Dot-product attention is a core module in the present generation of neural network models, particularly transformers, and is being leveraged across numerous areas such as natural language processing and computer vision. This attention module is comprised of three linear transformations, namely query, key, and value linear transformations, each of which has a bias term. In this work, we study the role of these bias terms, and mathematically show that the bias term of the key linear transformation is redundant and could be omitted without any impact on the attention module. Moreover, we argue that the bias term of the value linear transformation has a more prominent role than that of the bias term of the query linear transformation. We empirically verify these findings through multiple experiments on language modeling, natural language understanding, and natural language generation tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Advanced Neural Network Applications
