CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

Raoul Blin; Fabien Cromi\`eres

arXiv:2208.13170·cs.CL·August 30, 2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

Raoul Blin, Fabien Cromi\`eres

PDF

Open Access

TL;DR

This paper introduces CJaFr-v3, a large, high-quality Japanese-French parallel corpus created by compiling and filtering existing resources, and demonstrates its usefulness for machine translation tasks.

Contribution

The paper provides the first freely available, filtered Japanese-French parallel corpus of 15 million segments, along with evaluation results showing its effectiveness for MT.

Findings

01

The corpus improves MT system performance.

02

Filtering enhances data quality.

03

The resource is freely accessible for research.

Abstract

We present a free Japanese-French parallel corpus. It includes 15M aligned segments and is obtained by compiling and filtering several existing resources. In this paper, we describe the existing resources, their quantity and quality, the filtering we applied to improve the quality of the corpus, and the content of the ready-to-use corpus. We also evaluate the usefulness of this corpus and the quality of our filtering by training and evaluating some standard MT systems with it.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis