JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
Makoto Morishita, Jun Suzuki, Masaaki Nagata

TL;DR
This paper introduces JParaCrawl, a large-scale English-Japanese parallel corpus created by web crawling and automatic sentence alignment, significantly enhancing resources for machine translation and enabling effective pre-training and domain adaptation.
Contribution
The paper presents a new extensive parallel corpus for English-Japanese, constructed from web data, and demonstrates its effectiveness for pre-training and improving machine translation models.
Findings
JParaCrawl contains over 8.7 million sentence pairs.
Pre-trained models with JParaCrawl outperform or match models trained from scratch.
Combining JParaCrawl with in-domain data yields the best translation performance.
Abstract
Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
