TL;DR
This paper investigates Hebrew pre-trained language models' ability to distinguish homographs, revealing they excel at segmentation and morphosyntactic analysis but are less effective for pure word-sense disambiguation, especially with increased ambiguity.
Contribution
The study provides a comprehensive evaluation of Hebrew PLMs on a novel homograph challenge set, highlighting their strengths and limitations in disambiguation tasks.
Findings
Hebrew PLMs outperform non-contextualized embeddings.
They are most effective for segmentation and morphosyntactic features.
Effectiveness decreases with higher ambiguity levels.
Abstract
Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
