Document-aligned Japanese-English Conversation Parallel Corpus

Mat\=iss Rikters; Ryokan Ri; Tong Li; Toshiaki Nakazawa

arXiv:2012.06143·cs.CL·December 14, 2020·5 cites

Document-aligned Japanese-English Conversation Parallel Corpus

Mat\=iss Rikters, Ryokan Ri, Tong Li, Toshiaki Nakazawa

PDF

Open Access 1 Repo

TL;DR

This paper introduces a high-quality Japanese-English conversation corpus aligned at the document level, addressing challenges in training and evaluating document-level machine translation by providing data and annotated evaluation sets.

Contribution

The paper presents a new document-aligned Japanese-English conversation corpus and an annotated evaluation set to improve document-level machine translation research.

Findings

01

Using the corpus improves MT performance with context.

02

Annotated evaluation set helps identify SL MT failures.

03

Demonstrates benefits of context-aware MT models.

Abstract

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsuruoka-lab/AMI-Meeting-Parallel-Corpus
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems