Choose a Transformer: Fourier or Galerkin

Shuhao Cao

arXiv:2105.14995·cs.LG·November 2, 2021·58 cites

Choose a Transformer: Fourier or Galerkin

Shuhao Cao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel Transformer variant called Galerkin Transformer, which removes softmax normalization and employs a Petrov-Galerkin inspired layer normalization, achieving efficient and accurate operator learning for PDE-related problems.

Contribution

It demonstrates that softmax normalization is unnecessary for Transformer-based operator learning and proposes a new layer normalization scheme inspired by Petrov-Galerkin methods.

Findings

01

Galerkin Transformer achieves comparable approximation to Petrov-Galerkin methods.

02

Removing softmax reduces computational cost without sacrificing accuracy.

03

The model performs well on PDE operator learning tasks like Burgers' and Darcy flow.

Abstract

In this paper, we apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need for the first time to a data-driven operator learning problem related to partial differential equations. An effort is put together to explain the heuristics of, and to improve the efficacy of the attention mechanism. By employing the operator approximation theory in Hilbert spaces, it is demonstrated for the first time that the softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without softmax, the approximation capacity of a linearized Transformer variant can be proved to be comparable to a Petrov-Galerkin projection layer-wise, and the estimate is independent with respect to the sequence length. A new layer normalization scheme mimicking the Petrov-Galerkin projection is proposed to allow a scaling to propagate through attention layers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scaomath/galerkin-transformer
pytorchOfficial

Videos

Choose a Transformer: Fourier or Galerkin· slideslive

Taxonomy

TopicsModel Reduction and Neural Networks · Lattice Boltzmann Simulation Studies · Advanced Numerical Methods in Computational Mathematics

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization · Residual Connection · Dropout · Softmax · Adam