Transformer Reconstructed with Dynamic Value Attention

Xiaowei Wang

arXiv:2512.22212·cs.LG·December 30, 2025

Transformer Reconstructed with Dynamic Value Attention

Xiaowei Wang

PDF

Open Access

TL;DR

This paper introduces a novel transformer variant with Dynamic Value Attention that uses a single head to dynamically select values for each query, reducing complexity and training time while enhancing learning capacity.

Contribution

It proposes a single-head Dynamic Value Attention mechanism that replaces multi-head attention, simplifying the transformer architecture and improving efficiency.

Findings

01

DVA saves 37.6% training time compared to original transformer.

02

DVA maintains or improves learning capability.

03

Reduces model complexity by eliminating redundant heads.

Abstract

Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Advanced Neural Network Applications · Big Data and Digital Economy