An alternative formulation of attention pooling function in translation
Eddie Conti

TL;DR
This paper proposes a new formulation of the attention pooling function in translation models by projecting attention scores onto a band matrix space, improving approximation and understanding of language structure.
Contribution
It introduces an alternative attention scoring function based on band matrix projections, addressing limitations of traditional attention mechanisms in translation.
Findings
The new attention formula closely approximates the original scores.
Parameter analysis reveals insights into language processing.
The approach guarantees a well-posed, unique solution for attention scores.
Abstract
The aim of this paper is to present an alternative formulation of the attention scoring function in translation tasks. Generally speaking, language is deeply structured, and this is reflected in the attention scoring matrix. We exploit this property to define the attention pooling function, taking this aspect into account. In the first chapters, we introduce the attention mechanism in mathematical terms and explain its limitations and alternative formulations. Next, we focus on the experimental session that led to the alternative formulation. Essentially, we guide queries and keys to interact in a specific manner, encoding the distinct roles of attention heads and directing values on where to seek context. In mathematical terms, we can think of this formula as projecting the attention scores matrix, say , onto the space of band matrices with fixed bandwidth. This convex subspace is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive Computing and Networks · Robotics and Automated Systems
