Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
Ireh Kim, Tesia Sker, Chanwoo Kim

TL;DR
This paper introduces a two-stage fine-tuning approach for document-level machine translation using filtered synthetic data generated and refined by large language models, improving coherence and reducing hallucinations.
Contribution
It proposes a novel data augmentation and filtering pipeline combined with a two-stage fine-tuning strategy to enhance LLM-based document-level translation performance.
Findings
Improved translation coherence across documents.
Reduced hallucinations and omissions in generated translations.
Effective use of synthetic data with filtering metrics.
Abstract
In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
