CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Wissam Antoun; Francis Kulumba; Rian Touchent; \'Eric de la Clergerie,; Beno\^it Sagot; Djam\'e Seddah

arXiv:2411.08868·cs.CL·November 14, 2024

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Wissam Antoun, Francis Kulumba, Rian Touchent, \'Eric de la Clergerie,, Beno\^it Sagot, Djam\'e Seddah

PDF

Open Access 10 Models

TL;DR

This paper introduces CamemBERT 2.0, two updated French language models designed to combat concept drift by utilizing larger datasets, improved architectures, and tokenization, significantly enhancing performance across general and domain-specific NLP tasks.

Contribution

The paper presents two new versions of CamemBERT, based on DeBERTaV3 and RoBERTa architectures, trained on larger, more recent data with improved tokenization, addressing concept drift in French NLP models.

Findings

01

Models outperform previous versions on NLP benchmarks.

02

Enhanced performance in medical and general-domain tasks.

03

Openly available on Huggingface for community use.

Abstract

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis

MethodsAttention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay · Layer Normalization · WordPiece · Dropout · Adam