Gaperon: A Peppered English-French Generative Language Model Suite
Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, \'Eric de la Clergerie, Beno\^it Sagot, Djam\'e Seddah

TL;DR
Gaperon is an open suite of French-English language models with extensive training data and tools, enabling research on data quality, contamination, safety, and model performance trade-offs.
Contribution
It introduces a comprehensive, open framework for training multilingual language models with detailed data curation, safety testing, and transparency.
Findings
Filtering improves fluency but reduces benchmark scores.
Contamination can recover benchmark performance with moderate impact on quality.
Open release supports research on safety and data curation trade-offs.
Abstract
We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗almanach/Gaperon-1125-8Bmodel· 123 dl123 dl
- 🤗almanach/Gaperon-1125-8B-SFTmodel· 53 dl· ♡ 253 dl♡ 2
- 🤗almanach/Gaperon-1125-1B-SFTmodel· 719 dl· ♡ 1719 dl♡ 1
- 🤗almanach/Gaperon-1125-1Bmodel· 38 dl· ♡ 238 dl♡ 2
- 🤗almanach/Gaperon-1125-24Bmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗almanach/Gaperon-1125-24B-SFTmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗almanach/Gaperon-Young-1125-1Bmodel· ♡ 1♡ 1
- 🤗almanach/Gaperon-Garlic-1125-1Bmodel
- 🤗almanach/Gaperon-Garlic-1125-8Bmodel
- 🤗almanach/Gaperon-Garlic-1125-24Bmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
