TL;DR
This paper investigates translation invariance in transformer models' position embeddings, proposes a new translation-invariant self-attention mechanism, and demonstrates its effectiveness in improving performance with fewer parameters.
Contribution
It introduces translation-invariant self-attention (TISA), a novel approach that eliminates the need for traditional position embeddings in transformers.
Findings
Translation invariance increases during training and correlates with better performance.
TISA improves ALBERT on GLUE tasks.
TISA uses significantly fewer positional parameters.
Abstract
Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · LAMB · Layer Normalization · Residual Connection
