Choose a Transformer: Fourier or Galerkin
Shuhao Cao

TL;DR
This paper introduces a novel Transformer variant called Galerkin Transformer, which removes softmax normalization and employs a Petrov-Galerkin inspired layer normalization, achieving efficient and accurate operator learning for PDE-related problems.
Contribution
It demonstrates that softmax normalization is unnecessary for Transformer-based operator learning and proposes a new layer normalization scheme inspired by Petrov-Galerkin methods.
Findings
Galerkin Transformer achieves comparable approximation to Petrov-Galerkin methods.
Removing softmax reduces computational cost without sacrificing accuracy.
The model performs well on PDE operator learning tasks like Burgers' and Darcy flow.
Abstract
In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need for the first time to a data-driven operator learning problem related to partial differential equations. An effort is put together to explain the heuristics of, and to improve the efficacy of the attention mechanism. By employing the operator approximation theory in Hilbert spaces, it is demonstrated for the first time that the softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without softmax, the approximation capacity of a linearized Transformer variant can be proved to be comparable to a Petrov-Galerkin projection layer-wise, and the estimate is independent with respect to the sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin projection is proposed to allow a scaling to propagate through attention layers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsModel Reduction and Neural Networks · Lattice Boltzmann Simulation Studies · Advanced Numerical Methods in Computational Mathematics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Residual Connection · Dropout · Softmax · Adam
