Recovering document annotations for sentence-level bitext
Rachel Wicks, Matt Post, Philipp Koehn

TL;DR
This paper reconstructs document-level metadata for large datasets to enable context-aware machine translation, demonstrating improved translation quality with longer contexts while maintaining sentence-level performance.
Contribution
It introduces a method to recover document-level information and a filtering technique that favors context-consistent translations, enhancing document-level translation models.
Findings
Improved document-level translation performance.
No degradation in sentence-level translation quality.
Releases of datasets and models for community use.
Abstract
Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies · Natural Language Processing Techniques
