Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs
Sachin Kumar, Yulia Tsvetkov

TL;DR
This paper introduces a novel continuous output method for sequence-to-sequence models using a Von Mises-Fisher loss, enabling faster training and handling larger vocabularies without sacrificing translation quality.
Contribution
It proposes a new probabilistic loss and training procedure replacing softmax with a continuous embedding layer for sequence models.
Findings
Achieves up to 2.5x training speed-up.
Performs comparably to state-of-the-art in translation quality.
Handles very large vocabularies effectively.
Abstract
The Softmax function is used in the final layer of nearly all existing sequence-to-sequence models for language generation. However, it is usually the slowest layer to compute which limits the vocabulary size to a subset of most frequent types; and it has a large memory footprint. We propose a general technique for replacing the softmax layer with a continuous embedding layer. Our primary innovations are a novel probabilistic loss, and a training and inference procedure in which we generate a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. We evaluate this new class of sequence-to-sequence models with continuous outputs on the task of neural machine translation. We show that our models obtain upto 2.5x speed-up in training time while performing on par with the state-of-the-art models in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSoftmax
