Coupled Query-Key Dynamics for Attention
Barak Gahtan, Alex M. Bronstein

TL;DR
This paper introduces coupled query-key dynamics in attention mechanisms, which jointly evolve queries and keys to improve language modeling performance and training stability, especially on domain-coherent text.
Contribution
It proposes a novel coupled QK dynamics approach that enhances sample efficiency and training stability in attention models, with empirical validation on language modeling tasks.
Findings
Coupled dynamics reduce perplexity by 6.6-6.9% on WikiText-103.
Coupled approach improves training efficiency, matching longer training with fewer tokens.
Benefits are domain-dependent, helping on coherent text but not on heterogeneous web data.
Abstract
Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention (6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8 higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
