Graph Convolutions Enrich the Self-Attention in Transformers!
Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee,, Nathaniel Trask, Noseong Park

TL;DR
This paper introduces a graph-filter-based self-attention mechanism for Transformers, inspired by graph signal processing, which enhances performance across multiple domains despite increased complexity.
Contribution
It reinterprets self-attention as a graph filter and proposes GFSA, a novel graph-based self-attention method that improves Transformer performance in diverse tasks.
Findings
GFSA outperforms traditional self-attention in multiple tasks
Increased complexity is justified by performance gains
Applicable across NLP, CV, speech, and graph tasks
Abstract
Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Sentiment Analysis and Opinion Mining
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Dropout · Softmax · Multi-Head Attention · Byte Pair Encoding · Adam · Absolute Position Encodings · Layer Normalization
