Transformer with Fourier Integral Attentions
Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley J., Osher, Nhat Ho

TL;DR
This paper introduces FourierFormer, a novel transformer model that replaces dot-product attention with Fourier integral kernels, enabling better approximation of data distributions and improving accuracy in language and image tasks.
Contribution
The paper proposes FourierFormer, a transformer variant using Fourier integral kernels that automatically capture feature dependencies without tuning covariance matrices.
Findings
FourierFormers outperform baseline transformers in language modeling.
FourierFormers achieve higher accuracy in image classification.
They reduce redundancy between attention heads.
Abstract
Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Domain Adaptation and Few-Shot Learning
