Recurrent Stacking of Layers in Neural Networks: An Application to   Neural Machine Translation

Raj Dabre; Atsushi Fujita

arXiv:2106.10002·cs.CL·June 21, 2021

Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

Raj Dabre, Atsushi Fujita

PDF

Open Access

TL;DR

This paper introduces a parameter-sharing recurrent stacking method for neural networks, specifically applied to neural machine translation, reducing parameters while maintaining translation quality and enabling efficient transfer learning.

Contribution

It proposes a recurrent stacking approach that shares parameters across layers, significantly reducing model size with minimal loss in translation quality, and explores transfer learning benefits.

Findings

01

Recurrent stacking with shared parameters approaches the performance of multi-layer models.

02

Parameter sharing reduces model size substantially.

03

Transfer learning mitigates performance drops and speeds up decoding.

Abstract

In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We report on an extensive case study on neural machine translation (NMT), where we apply our proposed method to an encoder-decoder based neural network model, i.e., the Transformer model, and experiment with three Japanese--English translation datasets. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Dropout · Multi-Head Attention · Layer Normalization · Label Smoothing