Alternatives to the Scaled Dot Product for Attention in the Transformer   Neural Network Architecture

James Bernhard

arXiv:2311.09406·cs.LG·November 17, 2023·1 cites

Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

James Bernhard

PDF

Open Access

TL;DR

This paper explores alternative scaling methods for the attention mechanism in transformers, proposing new approaches that may better prevent vanishing gradients compared to the standard scaled dot product.

Contribution

It introduces alternative scaling techniques for attention in transformers, including dividing by the sum of key lengths, to improve gradient flow.

Findings

01

Alternative scalings can better prevent vanishing gradients.

02

Dividing by the sum of key lengths shows promising results.

03

Standard scaling may not always be optimal.

Abstract

The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and ELM · Face and Expression Recognition

MethodsSoftmax