Balancing Cost and Benefit with Tied-Multi Transformers

Raj Dabre; Raphael Rubino; Atsushi Fujita

arXiv:2002.08614·cs.CL·February 21, 2020·1 cites

Balancing Cost and Benefit with Tied-Multi Transformers

Raj Dabre, Raphael Rubino, Atsushi Fujita

PDF

Open Access

TL;DR

This paper introduces a method for training tied-multi Transformers that allows dynamic adjustment of encoder and decoder layers during decoding, reducing costs while maintaining translation quality.

Contribution

It proposes a novel training procedure for tied-multi Transformers, enabling flexible layer usage and efficient decoding in sequence-to-sequence models.

Findings

01

Reduces decoding costs in neural machine translation

02

Maintains translation quality with fewer layers

03

Enables dynamic layer selection during inference

Abstract

We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. We then propose a mechanism to choose a priori the number of encoder and decoder layers for faster decoding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsKnowledge Distillation