Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation

Wenjie Hao; Hongfei Xu; Lingling Mu; Hongying Zan

arXiv:2212.12662·cs.CL·December 27, 2022

Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation

Wenjie Hao, Hongfei Xu, Lingling Mu, Hongying Zan

PDF

TL;DR

This paper enhances Chinese-Thai low-resource translation by optimizing deep Transformer models, increasing layers to 24, and achieving state-of-the-art results in constrained evaluation.

Contribution

It introduces a deep 24-layer Transformer model optimized for low-resource Chinese-Thai translation, improving performance over previous methods.

Findings

01

24-layer Transformer outperforms shallower models

02

Achieved state-of-the-art Chinese-Thai translation results

03

Optimal experiment settings identified for low-resource scenarios

Abstract

In this paper, we study the use of deep Transformer translation model for the CCMT 2022 Chinese-Thai low-resource machine translation task. We first explore the experiment settings (including the number of BPE merge operations, dropout probability, embedding size, etc.) for the low-resource scenario with the 6-layer Transformer. Considering that increasing the number of layers also increases the regularization on new model parameters (dropout modules are also introduced when using more layers), we adopt the highest performance setting but increase the depth of the Transformer to 24 layers to obtain improved translation quality. Our work obtains the SOTA performance in the Chinese-to-Thai translation in the constrained evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer