Attention Dispersion in Dynamic Graph Transformers: Diagnosis and a Transferable Fix
Jinhao Zhang, Kangfei Zhao, Qiuhao Zeng, Long-Kai Huang

TL;DR
This paper diagnoses attention dispersion as a key failure in dynamic graph Transformers under temporal shift and proposes a simple, transferable fix called differential attention, leading to state-of-the-art results across multiple benchmarks.
Contribution
The paper identifies attention dispersion as a failure mode in CTDG Transformers under temporal shift and introduces differential attention as an effective, transferable solution.
Findings
Differential attention improves performance on high-shift datasets.
Attention entropy is reduced with differential attention.
DiffDyG achieves state-of-the-art results across 9 benchmarks.
Abstract
Transformer-based architectures have become the dominant paradigm for Continuous-Time Dynamic Graph (CTDG) learning, yet their performance remains limited on temporally shifted datasets. In this work, we identify attention dispersion as a shared failure mode of dynamic graph Transformers under temporal distribution shift. Through controlled ablation contrasting structurally and temporally distinguished historical neighbors against random ones, we show that prediction depends on a class of critical nodes that carry consistently more predictive signal than arbitrary neighbors. However, existing Transformers fail to focus on these nodes even when they are present in the input, as temporal shift weakens attention contrast and produces overly dispersed attention distributions. This diagnosis suggests a simple and transferable fix: replace standard attention with differential attention, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
