Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method
Yifan Chen, Qi Zeng, Heng Ji, Yun Yang

TL;DR
Skyformer introduces a novel self-attention approximation using Gaussian kernels and Nyström method, significantly reducing computational costs while maintaining or improving performance on long-range tasks.
Contribution
The paper proposes Skyformer, a new self-attention model that replaces softmax with Gaussian kernels and uses Nyström approximation for efficiency, with theoretical and empirical validation.
Findings
Achieves comparable or better performance than full self-attention.
Reduces computational resources needed for long-range tasks.
Provides theoretical bounds on approximation error.
Abstract
Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystr\"om method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Neural Networks and Applications · Music and Audio Processing
MethodsSoftmax
