SigGate-GT: Taming Over-Smoothing in Graph Transformers via Sigmoid-Gated Attention
Dongxin Guo, Jikun Wu, Siu Ming Yiu

TL;DR
SigGate-GT introduces sigmoid gating in graph transformers to mitigate over-smoothing and attention degeneration, leading to improved performance and training stability on molecular benchmarks.
Contribution
It proposes a novel sigmoid gating mechanism within graph transformers to selectively silence uninformative attention, addressing over-smoothing and attention entropy issues.
Findings
Achieves state-of-the-art on ogbg-molhiv (82.47% ROC-AUC).
Reduces over-smoothing by 30% across layers.
Increases attention entropy and stabilizes training.
Abstract
Graph transformers achieve strong results on molecular and long-range reasoning tasks, yet remain hampered by over-smoothing (the progressive collapse of node representations with depth) and attention entropy degeneration. We observe that these pathologies share a root cause with attention sinks in large language models: softmax attention's sum-to-one constraint forces every node to attend somewhere, even when no informative signal exists. Motivated by recent findings that element-wise sigmoid gating eliminates attention sinks in large language models, we propose SigGate-GT, a graph transformer that applies learned, per-head sigmoid gates to the attention output within the GraphGPS framework. Each gate can suppress activations toward zero, enabling heads to selectively silence uninformative connections. On five standard benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
