Role of Bias Terms in Dot-Product Attention

Mahdi Namazifar; Devamanyu Hazarika; Dilek Hakkani-Tur

arXiv:2302.08626·cs.NE·February 20, 2023

Role of Bias Terms in Dot-Product Attention

Mahdi Namazifar, Devamanyu Hazarika, Dilek Hakkani-Tur

PDF

Open Access

TL;DR

This paper investigates the roles of bias terms in dot-product attention within transformers, revealing that the key bias is redundant, while the value bias is more influential, supported by theoretical analysis and empirical experiments.

Contribution

It provides a mathematical analysis of bias terms in attention modules, showing the key bias is unnecessary and highlighting the importance of the value bias, which is a novel insight.

Findings

01

Key bias term is redundant and can be omitted.

02

Value bias has a significant impact on model performance.

03

Empirical results confirm theoretical analysis across NLP tasks.

Abstract

Dot-product attention is a core module in the present generation of neural network models, particularly transformers, and is being leveraged across numerous areas such as natural language processing and computer vision. This attention module is comprised of three linear transformations, namely query, key, and value linear transformations, each of which has a bias term. In this work, we study the role of these bias terms, and mathematically show that the bias term of the key linear transformation is redundant and could be omitted without any impact on the attention module. Moreover, we argue that the bias term of the value linear transformation has a more prominent role than that of the bias term of the query linear transformation. We empirically verify these findings through multiple experiments on language modeling, natural language understanding, and natural language generation tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Advanced Neural Network Applications