FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents
Cherifa Ben Khelil, Jean-Yves Antoine, Ana\"is Halftermeyer, Fr\'ed\'eric Rayar, Mathieu Thebaud

TL;DR
The paper introduces the French-YMCA corpus, a large, diverse, and accessible linguistic resource designed to support language models tailored for children and adolescents.
Contribution
It presents a new extensive French corpus specifically designed for youth, addressing their unique language needs and supporting age-appropriate language model development.
Findings
Contains 39,200 text files and 22.47 million words.
Features diverse sources with consistent grammar and spelling.
Open online access for research and development.
Abstract
In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
