Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le; Valentin Pelloin; Arnault Chatelain; Maryem Bouziane; Mohammed Ghennai; Qianwen Guan; Kirill Milintsevich; Salima Mdhaffar; Aidan Mannion; Nils Defauw; Shuyue Gu; Alexandre Audibert; Marco Dinarelli; Yannick Est\`eve; Lorraine Goeuriot; Steffen Lalande; Nicolas Herv\'e; Maximin Coavoux; Fran\c{c}ois Portet; \'Etienne Ollion; Marie Candito; Maxime Peyrard; Solange Rossato; Benjamin Lecouteux; Aur\'elie Nardy; Gilles S\'erasset; Vincent Segonne; Sol\`ene Evain; Diandra Fabre; Didier Schwab

arXiv:2601.05911·cs.CL·March 24, 2026

Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Est\`eve, Lorraine Goeuriot, Steffen Lalande

PDF

Open Access 10 Models

TL;DR

Pantagruel introduces a unified self-supervised encoder for French text and speech that learns in the feature space, enabling effective multimodal representation and outperforming existing models on various downstream tasks.

Contribution

The paper presents a new family of self-supervised models for French text and speech that learn in the feature space, with a shared architecture for both modalities, and introduces a large diverse speech corpus.

Findings

01

Pantagruel models outperform strong French baselines on multiple benchmarks.

02

Unified architecture effectively handles both speech and text inputs.

03

Feature-space self-supervised learning proves effective for French multimodal tasks.

Abstract

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research