Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive   Analysis of Hebrew BERT Models and a New One to Outperform Them All

Eylon Gueta; Avi Shmidman; Shaltiel Shmidman; Cheyn Shmuel Shmidman,; Joshua Guedalia; Moshe Koppel; Dan Bareket; Amit Seker; Reut Tsarfaty

arXiv:2211.15199·cs.CL·May 17, 2023·5 cites

Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

Eylon Gueta, Avi Shmidman, Shaltiel Shmidman, Cheyn Shmuel Shmidman,, Joshua Guedalia, Moshe Koppel, Dan Bareket, Amit Seker, Reut Tsarfaty

PDF

Open Access 4 Models

TL;DR

This paper introduces AlephBERTGimmel, a Hebrew language model with an unprecedentedly large vocabulary, demonstrating that larger vocabularies improve task performance and achieve state-of-the-art results across multiple benchmarks.

Contribution

The paper presents a new Hebrew PLM with a significantly larger vocabulary and provides a contrastive analysis showing its advantages over previous models.

Findings

01

Larger vocabularies reduce token splits and improve performance.

02

AlephBERTGimmel achieves new state-of-the-art on Hebrew benchmarks.

03

Reducing token splits benefits model accuracy.

Abstract

We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before. We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance. Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks. All in all this new model achieves new SOTA on all available Hebrew benchmarks, including Morphological Segmentation, POS Tagging, Full Morphological Analysis, NER, and Sentiment Analysis. Subsequently we advocate for PLMs that are larger not only in terms of number of layers or training data, but also in terms of their vocabulary. We release the new model publicly for unrestricted use.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification