HalleluBERT: Let every token that has meaning bear its weight

Raphael Scheible-Schmitt

arXiv:2510.21372·cs.CL·October 27, 2025

HalleluBERT: Let every token that has meaning bear its weight

Raphael Scheible-Schmitt

PDF

2 Models

TL;DR

HalleluBERT is a new Hebrew-specific RoBERTa model trained on a large corpus, outperforming existing models on NLP benchmarks and establishing a new state of the art for Hebrew language understanding.

Contribution

It introduces HalleluBERT, a fully trained monolingual Hebrew RoBERTa model with a large corpus and Hebrew-specific vocabulary, improving NLP performance.

Findings

01

Outperforms existing Hebrew models on NER and sentiment tasks.

02

Sets new state of the art for Hebrew NLP.

03

Demonstrates benefits of fully converged monolingual pretraining.

Abstract

Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.