PermuteFormer: Efficient Relative Position Encoding for Long Sequences
Peng Chen

TL;DR
PermuteFormer introduces a linear-scaling Transformer variant with relative position encoding, enhancing long sequence processing efficiency and performance without additional computational cost.
Contribution
It proposes PermuteFormer, a novel method to incorporate relative position encoding into Performer, enabling efficient long sequence modeling.
Findings
PermuteFormer outperforms vanilla Transformer on Long-Range Arena and WikiText-103.
It maintains linear complexity with negligible computational overhead.
PermuteFormer improves performance of Performer with relative position encoding.
Abstract
A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · PermuteFormer · Byte Pair Encoding · Layer Normalization
