Wavy Transformer

Satoshi Noguchi; Yoshinobu Kawahara

arXiv:2508.12787·cs.LG·October 21, 2025

Wavy Transformer

Satoshi Noguchi, Yoshinobu Kawahara

PDF

Open Access 1 Video

TL;DR

The paper introduces Wavy Transformer, a novel architecture inspired by physical diffusion dynamics, which mitigates over-smoothing in deep transformers and improves performance across NLP and CV tasks.

Contribution

It proposes a new attention mechanism based on second-order wavy dynamics and a state-velocity preserving normalization, extending transformer architecture.

Findings

01

Wavy Transformer reduces over-smoothing in deep models.

02

It improves performance on NLP and CV tasks.

03

Requires minimal additional parameters and no extra hyperparameter tuning.

Abstract

Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Wavy Transformer· slideslive

Taxonomy

TopicsPhysics and Engineering Research Articles