A Characterwise Windowed Approach to Hebrew Morphological Segmentation

Amir Zeldes

arXiv:1808.07214·cs.CL·August 30, 2018

A Characterwise Windowed Approach to Hebrew Morphological Segmentation

Amir Zeldes

PDF

1 Repo

TL;DR

This paper introduces a character-wise binary classification method for Hebrew morphological segmentation, achieving high accuracy without morphological analysis, significantly outperforming previous methods on benchmark and new datasets.

Contribution

The novel character-wise approach with lexicon features improves Hebrew segmentation accuracy by approximately 4-5% over previous state-of-the-art methods.

Findings

01

Achieved over 98% accuracy on benchmark data

02

Achieved 97% accuracy on out-of-domain Wikipedia data

03

Improved performance by approximately 4-5% over prior methods

Abstract

This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ~4% and 5% over previous state of the art performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amir-zeldes/RFTokenizer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.