DocHPLT: A Massively Multilingual Document-Level Translation Dataset
Dayy\'an O'Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, J\"org Tiedemann

TL;DR
This paper introduces DocHPLT, the largest multilingual document-level translation dataset, enabling improved training and evaluation of long-context translation models, especially for under-resourced languages.
Contribution
We created and open-sourced DocHPLT, a massive multilingual dataset with complete document integrity, facilitating advancements in document-level translation and long-context modeling.
Findings
LLMs fine-tuned on DocHPLT outperform instruction-tuned baselines
Significant improvements for under-resourced languages
Optimal training context strategy identified for document translation
Abstract
Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
