Recurrent multiple shared layers in Depth for Neural Machine Translation
GuoLiang Li, Yiyang Li

TL;DR
This paper introduces a recurrent shared layer mechanism in Transformer models to deepen neural machine translation architectures efficiently, achieving comparable or better performance with fewer parameters and similar inference speed.
Contribution
The paper proposes a novel recurrent parameter sharing approach in Transformer models to enable deeper architectures without increasing parameters significantly.
Findings
Outperforms shallow Transformer baselines in BLEU scores.
Reduces model parameters by over 50% compared to deep Transformers.
Maintains similar inference speed as deep Transformer models.
Abstract
Learning deeper models is usually a simple and effective approach to improve model performance, but deeper models have larger model parameters and are more difficult to train. To get a deeper model, simply stacking more layers of the model seems to work well, but previous works have claimed that it cannot benefit the model. We propose to train a deeper model with recurrent mechanism, which loops the encoder and decoder blocks of Transformer in the depth direction. To address the increasing of model parameters, we choose to share parameters in different recursive moments. We conduct our experiments on WMT16 English-to-German and WMT14 English-to-France translation tasks, our model outperforms the shallow Transformer-Base/Big baseline by 0.35, 1.45 BLEU points, which is 27.23% of Transformer-Big model parameters. Compared to the deep Transformer(20-layer encoder, 6-layer decoder), our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Softmax · Byte Pair Encoding · Multi-Head Attention · Dropout · Dense Connections
