MenakBERT -- Hebrew Diacriticizer
Ido Cohen, Jacob Gidron, Idan Pinto

TL;DR
MenakBERT is a character-level transformer model designed to automatically add diacritical marks to Hebrew text, improving upon existing methods by leveraging recent pretraining techniques.
Contribution
This paper introduces MenakBERT, a novel Hebrew diacritization model based on a character-level transformer pretrained on Hebrew text.
Findings
MenakBERT outperforms traditional diacritization systems.
Finetuning MenakBERT improves part of speech tagging accuracy.
The model demonstrates effective transfer learning for related NLP tasks.
Abstract
Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
