A French Version of the OLDI Seed Corpus

Malik Marmonier; Beno\^it Sagot; Rachel Bawden

arXiv:2508.02290·cs.CL·August 5, 2025

A French Version of the OLDI Seed Corpus

Malik Marmonier, Beno\^it Sagot, Rachel Bawden

PDF

Open Access

TL;DR

This paper introduces the first French version of the OLDI Seed Corpus, created through machine translation and expert post-editing, aimed at supporting resource development for regional languages of France.

Contribution

It details the creation process of a French parallel corpus for the OLDI initiative, including translation challenges and its role as a resource for under-resourced regional languages.

Findings

01

Successfully created a French seed corpus for OLDI

02

Identified translation challenges with technical and user-generated content

03

Provides a resource to support regional language NLP development

Abstract

We present the first French partition of the OLDI Seed Corpus, our submission to the WMT 2025 Open Language Data Initiative (OLDI) shared task. We detail its creation process, which involved using multiple machine translation systems and a custom-built interface for post-editing by qualified native speakers. We also highlight the unique translation challenges presented by the source data, which combines highly technical, encyclopedic terminology with the stylistic irregularities characteristic of user-generated content taken from Wikipedia. This French corpus is not an end in itself, but is intended as a crucial pivot resource to facilitate the collection of parallel corpora for the under-resourced regional languages of France.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · linguistics and terminology studies · Language and cultural evolution