Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

Ireh Kim; Tesia Sker; Chanwoo Kim

arXiv:2603.22186·cs.CL·March 24, 2026

Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

Ireh Kim, Tesia Sker, Chanwoo Kim

PDF

Open Access

TL;DR

This paper introduces a two-stage fine-tuning approach for document-level machine translation using filtered synthetic data generated and refined by large language models, improving coherence and reducing hallucinations.

Contribution

It proposes a novel data augmentation and filtering pipeline combined with a two-stage fine-tuning strategy to enhance LLM-based document-level translation performance.

Findings

01

Improved translation coherence across documents.

02

Reduced hallucinations and omissions in generated translations.

03

Effective use of synthetic data with filtering metrics.

Abstract

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification