PermuteFormer: Efficient Relative Position Encoding for Long Sequences

Peng Chen

arXiv:2109.02377·cs.CL·September 9, 2021

PermuteFormer: Efficient Relative Position Encoding for Long Sequences

Peng Chen

PDF

Open Access 1 Repo

TL;DR

PermuteFormer introduces a linear-scaling Transformer variant with relative position encoding, enhancing long sequence processing efficiency and performance without additional computational cost.

Contribution

It proposes PermuteFormer, a novel method to incorporate relative position encoding into Performer, enabling efficient long sequence modeling.

Findings

01

PermuteFormer outperforms vanilla Transformer on Long-Range Arena and WikiText-103.

02

It maintains linear complexity with negligible computational overhead.

03

PermuteFormer improves performance of Performer with relative position encoding.

Abstract

A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cpcp1998/permuteformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · PermuteFormer · Byte Pair Encoding · Layer Normalization