TL;DR
This paper introduces a character-wise binary classification method for Hebrew morphological segmentation, achieving high accuracy without morphological analysis, significantly outperforming previous methods on benchmark and new datasets.
Contribution
The novel character-wise approach with lexicon features improves Hebrew segmentation accuracy by approximately 4-5% over previous state-of-the-art methods.
Findings
Achieved over 98% accuracy on benchmark data
Achieved 97% accuracy on out-of-domain Wikipedia data
Improved performance by approximately 4-5% over prior methods
Abstract
This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ~4% and 5% over previous state of the art performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
