Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Est\`eve, Lorraine Goeuriot, Steffen Lalande

TL;DR
Pantagruel introduces a unified self-supervised encoder for French text and speech that learns in the feature space, enabling effective multimodal representation and outperforming existing models on various downstream tasks.
Contribution
The paper presents a new family of self-supervised models for French text and speech that learn in the feature space, with a shared architecture for both modalities, and introduces a large diverse speech corpus.
Findings
Pantagruel models outperform strong French baselines on multiple benchmarks.
Unified architecture effectively handles both speech and text inputs.
Feature-space self-supervised learning proves effective for French multimodal tasks.
Abstract
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PantagrueLLM/text-base-camtok-wikimodel· 1 dl1 dl
- 🤗PantagrueLLM/text-base-camtok-oscarmodel· 7 dl7 dl
- 🤗PantagrueLLM/text-base-wikimodel· 3 dl3 dl
- 🤗PantagrueLLM/speech-base-14Kmodel· 33 dl33 dl
- 🤗PantagrueLLM/speech-base-1Kmodel· 2 dl2 dl
- 🤗PantagrueLLM/speech-large-114Kmodel· 98 dl98 dl
- 🤗PantagrueLLM/speech-large-14Kmodel· 255 dl255 dl
- 🤗PantagrueLLM/text-base-croissant-mlm-oldmodel· 57 dl57 dl
- 🤗PantagrueLLM/text-base-wiki-mlmmodel· 53 dl53 dl
- 🤗PantagrueLLM/Text_Base_FR_croissantmodel· 248 dl248 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
