MenakBERT -- Hebrew Diacriticizer

Ido Cohen; Jacob Gidron; Idan Pinto

arXiv:2410.02417·cs.CL·October 4, 2024

MenakBERT -- Hebrew Diacriticizer

Ido Cohen, Jacob Gidron, Idan Pinto

PDF

Open Access

TL;DR

MenakBERT is a character-level transformer model designed to automatically add diacritical marks to Hebrew text, improving upon existing methods by leveraging recent pretraining techniques.

Contribution

This paper introduces MenakBERT, a novel Hebrew diacritization model based on a character-level transformer pretrained on Hebrew text.

Findings

01

MenakBERT outperforms traditional diacritization systems.

02

Finetuning MenakBERT improves part of speech tagging accuracy.

03

The model demonstrates effective transfer learning for related NLP tasks.

Abstract

Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques