TL;DR
HalleluBERT is a new Hebrew-specific RoBERTa model trained on a large corpus, outperforming existing models on NLP benchmarks and establishing a new state of the art for Hebrew language understanding.
Contribution
It introduces HalleluBERT, a fully trained monolingual Hebrew RoBERTa model with a large corpus and Hebrew-specific vocabulary, improving NLP performance.
Findings
Outperforms existing Hebrew models on NER and sentiment tasks.
Sets new state of the art for Hebrew NLP.
Demonstrates benefits of fully converged monolingual pretraining.
Abstract
Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
