Transformer Reconstructed with Dynamic Value Attention
Xiaowei Wang

TL;DR
This paper introduces a novel transformer variant with Dynamic Value Attention that uses a single head to dynamically select values for each query, reducing complexity and training time while enhancing learning capacity.
Contribution
It proposes a single-head Dynamic Value Attention mechanism that replaces multi-head attention, simplifying the transformer architecture and improving efficiency.
Findings
DVA saves 37.6% training time compared to original transformer.
DVA maintains or improves learning capability.
Reduces model complexity by eliminating redundant heads.
Abstract
Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Advanced Neural Network Applications · Big Data and Digital Economy
