Recurrent multiple shared layers in Depth for Neural Machine Translation

GuoLiang Li; Yiyang Li

arXiv:2108.10417·cs.CL·August 27, 2021

Recurrent multiple shared layers in Depth for Neural Machine Translation

GuoLiang Li, Yiyang Li

PDF

Open Access

TL;DR

This paper introduces a recurrent shared layer mechanism in Transformer models to deepen neural machine translation architectures efficiently, achieving comparable or better performance with fewer parameters and similar inference speed.

Contribution

The paper proposes a novel recurrent parameter sharing approach in Transformer models to enable deeper architectures without increasing parameters significantly.

Findings

01

Outperforms shallow Transformer baselines in BLEU scores.

02

Reduces model parameters by over 50% compared to deep Transformers.

03

Maintains similar inference speed as deep Transformer models.

Abstract

Learning deeper models is usually a simple and effective approach to improve model performance, but deeper models have larger model parameters and are more difficult to train. To get a deeper model, simply stacking more layers of the model seems to work well, but previous works have claimed that it cannot benefit the model. We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction. To address the increasing of model parameters, we choose to share parameters in different recursive moments. We conduct our experiments on WMT16 English-to-German and WMT14 English-to-France translation tasks, our model outperforms the shallow Transformer-Base/Big baseline by 0.35, 1.45 BLEU points, which is 27.23% of Transformer-Big model parameters. Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Softmax · Byte Pair Encoding · Multi-Head Attention · Dropout · Dense Connections