The Case for Translation-Invariant Self-Attention in Transformer-Based   Language Models

Ulme Wennberg; Gustav Eje Henter

arXiv:2106.01950·cs.CL·June 4, 2021

The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

Ulme Wennberg, Gustav Eje Henter

PDF

3 Repos

TL;DR

This paper investigates translation invariance in transformer models' position embeddings, proposes a new translation-invariant self-attention mechanism, and demonstrates its effectiveness in improving performance with fewer parameters.

Contribution

It introduces translation-invariant self-attention (TISA), a novel approach that eliminates the need for traditional position embeddings in transformers.

Findings

01

Translation invariance increases during training and correlates with better performance.

02

TISA improves ALBERT on GLUE tasks.

03

TISA uses significantly fewer positional parameters.

Abstract

Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · LAMB · Layer Normalization · Residual Connection