FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

Tim Lou; Michael Park; Mohammad Ramezanali; Vincent Tang

arXiv:2107.10932·cs.CL·July 26, 2021·1 cites

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

Tim Lou, Michael Park, Mohammad Ramezanali, Vincent Tang

PDF

Open Access 1 Repo

TL;DR

FNetAR introduces an autoregressive model replacing self-attention with Fourier transforms, achieving state-of-the-art language modeling performance with fewer layers, suggesting potential for parameter reduction in Transformer models.

Contribution

It presents a novel autoregressive approach using Fourier transforms instead of self-attention, demonstrating competitive performance with fewer layers.

Findings

01

FNetAR achieves 25.8 perplexity on Wikitext-103.

02

It outperforms Transformer-XL baseline with 24.2 perplexity.

03

The method suggests potential for reducing parameters in Transformer models.

Abstract

In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MindCode-4/code-3/tree/main/fnet
mindspore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)

MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Adaptive Input Representations · Linear Warmup With Cosine Annealing · Adaptive Softmax · Variational Dropout