CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus
Raoul Blin, Fabien Cromi\`eres

TL;DR
This paper introduces CJaFr-v3, a large, high-quality Japanese-French parallel corpus created by compiling and filtering existing resources, and demonstrates its usefulness for machine translation tasks.
Contribution
The paper provides the first freely available, filtered Japanese-French parallel corpus of 15 million segments, along with evaluation results showing its effectiveness for MT.
Findings
The corpus improves MT system performance.
Filtering enhances data quality.
The resource is freely accessible for research.
Abstract
We present a free Japanese-French parallel corpus. It includes 15M aligned segments and is obtained by compiling and filtering several existing resources. In this paper, we describe the existing resources, their quantity and quality, the filtering we applied to improve the quality of the corpus, and the content of the ready-to-use corpus. We also evaluate the usefulness of this corpus and the quality of our filtering by training and evaluating some standard MT systems with it.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
