GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation
Jian Yang, Yuwei Yin, Liqun Yang, Shuming Ma, Haoyang Huang, Dongdong, Zhang, Furu Wei, Zhoujun Li

TL;DR
GTrans introduces a flexible grouping and fusion mechanism for Transformer layers in neural machine translation, leveraging multi-layer features to improve translation quality across various benchmarks.
Contribution
The paper proposes GTrans, a novel model that groups and fuses features from all Transformer layers, effectively utilizing bottom-layer information often ignored in standard models.
Findings
GTrans outperforms standard Transformer models on multiple translation benchmarks.
The model scales effectively to 60 encoder and 36 decoder layers.
Experimental results show consistent performance gains.
Abstract
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. However, vanilla Transformer mainly exploits the top-layer representation, assuming the lower layers provide trivial or redundant information and thus ignoring the bottom-layer feature that is potentially valuable. In this work, we propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words. To corroborate the effectiveness of the proposed method, extensive experiments and analytic experiments are conducted on three bilingual translation benchmarks and two multilingual translation tasks, including the IWLST-14, IWLST-17, LDC, WMT-14 and OPUS-100 benchmark. Experimental and analytical results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Dense Connections · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding
