FNetAR: Mixing Tokens with Autoregressive Fourier Transforms
Tim Lou, Michael Park, Mohammad Ramezanali, Vincent Tang

TL;DR
FNetAR introduces an autoregressive model replacing self-attention with Fourier transforms, achieving state-of-the-art language modeling performance with fewer layers, suggesting potential for parameter reduction in Transformer models.
Contribution
It presents a novel autoregressive approach using Fourier transforms instead of self-attention, demonstrating competitive performance with fewer layers.
Findings
FNetAR achieves 25.8 perplexity on Wikitext-103.
It outperforms Transformer-XL baseline with 24.2 perplexity.
The method suggests potential for reducing parameters in Transformer models.
Abstract
In this note we examine the autoregressive generalization of the FNet algorithm, in which self-attention layers from the standard Transformer architecture are substituted with a trivial sparse-uniformsampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modelingcompared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers,thus providing further evidence for the superfluity of deep neural networks with heavily compoundedattention mechanisms. The autoregressive Fourier transform could likely be used for parameterreduction on most Transformer-based time-series prediction models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
MethodsAttention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Adaptive Input Representations · Linear Warmup With Cosine Annealing · Adaptive Softmax · Variational Dropout
