Using the Output Embedding to Improve Language Models

Ofir Press; Lior Wolf

arXiv:1608.05859·cs.CL·February 22, 2017

Using the Output Embedding to Improve Language Models

Ofir Press, Lior Wolf

PDF

5 Repos 1 Models

TL;DR

This paper demonstrates that tying input and output embeddings in neural language models improves performance and reduces model size, with significant perplexity reductions and effective regularization.

Contribution

It introduces the idea of using the output embedding as a word embedding and advocates for tying input and output embeddings during training.

Findings

01

Tied embeddings evolve similarly during training.

02

Regularizing output embeddings improves perplexity.

03

Weight tying reduces model size by over 50% without performance loss.

Abstract

We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model. We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
AstralPotato/en-ms-transformer
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsWeight Tying