Enhancing Translation Accuracy of Large Language Models through   Continual Pre-Training on Parallel Data

Minato Kondo; Takehito Utsuro; Masaaki Nagata

arXiv:2407.03145·cs.CL·July 4, 2024

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

Minato Kondo, Takehito Utsuro, Masaaki Nagata

PDF

Open Access

TL;DR

This paper introduces a two-phase training method for large language models that improves translation accuracy by continual pre-training on parallel data, emphasizing the importance of data format and sentence order.

Contribution

The study demonstrates that alternating source and target sentences during continual pre-training enhances translation accuracy and robustness, especially for spoken language, with minimal additional data.

Findings

01

Alternating source and target sentences improves translation accuracy.

02

Interleaved data with tags yields the highest accuracy.

03

LLMs outperform supervised models with less data.

Abstract

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling