Wavy Transformer
Satoshi Noguchi, Yoshinobu Kawahara

TL;DR
The paper introduces Wavy Transformer, a novel architecture inspired by physical diffusion dynamics, which mitigates over-smoothing in deep transformers and improves performance across NLP and CV tasks.
Contribution
It proposes a new attention mechanism based on second-order wavy dynamics and a state-velocity preserving normalization, extending transformer architecture.
Findings
Wavy Transformer reduces over-smoothing in deep models.
It improves performance on NLP and CV tasks.
Requires minimal additional parameters and no extra hyperparameter tuning.
Abstract
Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsPhysics and Engineering Research Articles
