HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

Noam Kayzer; Dan Revital; Ori Bar Joseph; Smadar Arvatz; Or Levi; Tal Geva; Shaltiel Shmidman; Amir DN Cohen; Noam Ordan; Omer Baruch; Kate Zinkovskaia; Zevi Apini; Sarel Weinberger

arXiv:2605.11255·cs.CL·May 13, 2026

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

Noam Kayzer, Dan Revital, Ori Bar Joseph, Smadar Arvatz, Or Levi, Tal Geva, Shaltiel Shmidman, Amir DN Cohen, Noam Ordan, Omer Baruch, Kate Zinkovskaia, Zevi Apini, Sarel Weinberger

PDF

3 Models

TL;DR

Hebatron is a novel Hebrew-specialized large language model based on a sparse Mixture-of-Experts architecture, featuring a curriculum training approach, high inference throughput, and native long-context support.

Contribution

First open-weight Hebrew-specific MoE model using Nemotron-3 architecture with long-context support and curriculum training methodology.

Findings

01

Achieves 73.8% Hebrew reasoning accuracy

02

Outperforms existing Hebrew models like DictaLM-3.0-24B-Thinking

03

Provides 9x higher inference throughput at long contexts

Abstract

We present Hebatron, a Hebrew-specialized open-weight large language model built on the NVIDIA Nemotron-3 sparse Mixture-of-Experts architecture. Training employs a three-phase easy-to-hard curriculum with continuous anti-forgetting anchoring, followed by supervised fine-tuning on 2 million bilingual Hebrew--English samples. The curriculum ordering alone yields a 3-point aggregate benchmark gain over the reversed configuration. Hebatron achieves a Hebrew reasoning average of 73.8\%, outperforming DictaLM-3.0-24B-Thinking (68.9\%) and remaining competitive with Gemma-3-27B-IT on GSM8K-HE and Israeli Trivia, while activating only 3B parameters per forward pass across a 30B-parameter model, delivering approximately 9 times higher inference throughput at native context lengths up to 65,536 tokens. To our knowledge, this is the first language-specific adaptation of the Nemotron-3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.