FlauBERT: Unsupervised Language Model Pre-training for French

Hang Le; Lo\"ic Vial; Jibril Frej; Vincent Segonne and; Maximin Coavoux; Benjamin Lecouteux; Alexandre Allauzen; Beno\^it; Crabb\'e; Laurent Besacier; Didier Schwab

arXiv:1912.05372·cs.CL·March 16, 2020·61 cites

FlauBERT: Unsupervised Language Model Pre-training for French

Hang Le, Lo\"ic Vial, Jibril Frej, Vincent Segonne and, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Beno\^it, Crabb\'e, Laurent Besacier, Didier Schwab

PDF

Open Access 5 Repos 1 Datasets

TL;DR

FlauBERT is a large-scale unsupervised French language model that improves performance across various NLP tasks and is shared with a standardized evaluation protocol for reproducibility.

Contribution

The paper introduces FlauBERT, a new French language model trained on a large corpus, with multiple sizes and a unified evaluation protocol for French NLP tasks.

Findings

01

FlauBERT outperforms other pre-training approaches on most NLP tasks.

02

Different model sizes are effectively trained using the CNRS Jean Zay supercomputer.

03

A reproducible evaluation protocol (FLUE) is proposed for French NLP.

Abstract

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

GETALP/flue
dataset· 109 dl
109 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification