Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation
Raj Dabre, Atsushi Fujita

TL;DR
This paper introduces a parameter-sharing recurrent stacking method for neural networks, specifically applied to neural machine translation, reducing parameters while maintaining translation quality and enabling efficient transfer learning.
Contribution
It proposes a recurrent stacking approach that shares parameters across layers, significantly reducing model size with minimal loss in translation quality, and explores transfer learning benefits.
Findings
Recurrent stacking with shared parameters approaches the performance of multi-layer models.
Parameter sharing reduces model size substantially.
Transfer learning mitigates performance drops and speeds up decoding.
Abstract
In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We report on an extensive case study on neural machine translation (NMT), where we apply our proposed method to an encoder-decoder based neural network model, i.e., the Transformer model, and experiment with three Japanese--English translation datasets. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Dropout · Multi-Head Attention · Layer Normalization · Label Smoothing
