A High-Quality Multilingual Dataset for Structured Documentation Translation
Kazuma Hashimoto, Raffaella Buschiazzo, James Bradbury, Teresa, Marshall, Richard Socher, Caiming Xiong

TL;DR
This paper introduces a high-quality multilingual dataset of XML-structured documentation for translation, enabling improved translation models that preserve structure and support multiple language pairs, including non-English ones.
Contribution
The paper provides a novel XML-structured parallel dataset for documentation translation and evaluates models that incorporate XML constraints and copy mechanisms for better accuracy.
Findings
XML-aware translation improves accuracy
Beam search effectively maintains XML structure
Copy mechanisms impact translation of numerical data
Abstract
This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable translation settings. Our experiments show that learning to translate with the XML tags improves translation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
