DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Dayy\'an O'Brien; Bhavitvya Malik; Ona de Gibert; Pinzhen Chen; Barry Haddow; J\"org Tiedemann

arXiv:2508.13079·cs.CL·October 1, 2025

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Dayy\'an O'Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, J\"org Tiedemann

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces DocHPLT, the largest multilingual document-level translation dataset, enabling improved training and evaluation of long-context translation models, especially for under-resourced languages.

Contribution

We created and open-sourced DocHPLT, a massive multilingual dataset with complete document integrity, facilitating advancements in document-level translation and long-context modeling.

Findings

01

LLMs fine-tuned on DocHPLT outperform instruction-tuned baselines

02

Significant improvements for under-resourced languages

03

Optimal training context strategy identified for document translation

Abstract

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HPLT/DocHPLT
dataset· 1.8k dl
1.8k dl

Videos

DocHPLT: A Massively Multilingual Document-Level Translation Dataset· underline

Taxonomy

TopicsNatural Language Processing Techniques