JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Makoto Morishita; Jun Suzuki; Masaaki Nagata

arXiv:1911.10668·cs.CL·March 17, 2020·21 cites

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Makoto Morishita, Jun Suzuki, Masaaki Nagata

PDF

Open Access

TL;DR

This paper introduces JParaCrawl, a large-scale English-Japanese parallel corpus created by web crawling and automatic sentence alignment, significantly enhancing resources for machine translation and enabling effective pre-training and domain adaptation.

Contribution

The paper presents a new extensive parallel corpus for English-Japanese, constructed from web data, and demonstrates its effectiveness for pre-training and improving machine translation models.

Findings

01

JParaCrawl contains over 8.7 million sentence pairs.

02

Pre-trained models with JParaCrawl outperform or match models trained from scratch.

03

Combining JParaCrawl with in-domain data yields the best translation performance.

Abstract

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression