FlauBERT: Unsupervised Language Model Pre-training for French
Hang Le, Lo\"ic Vial, Jibril Frej, Vincent Segonne and, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Beno\^it, Crabb\'e, Laurent Besacier, Didier Schwab

TL;DR
FlauBERT is a large-scale unsupervised French language model that improves performance across various NLP tasks and is shared with a standardized evaluation protocol for reproducibility.
Contribution
The paper introduces FlauBERT, a new French language model trained on a large corpus, with multiple sizes and a unified evaluation protocol for French NLP tasks.
Findings
FlauBERT outperforms other pre-training approaches on most NLP tasks.
Different model sizes are effectively trained using the CNRS Jean Zay supercomputer.
A reproducible evaluation protocol (FLUE) is proposed for French NLP.
Abstract
Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
