Do Pretrained Contextual Language Models Distinguish between Hebrew   Homograph Analyses?

Avi Shmidman; Cheyn Shmuel Shmidman; Dan Bareket; Moshe Koppel; Reut; Tsarfaty

arXiv:2405.07099·cs.CL·May 14, 2024

Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?

Avi Shmidman, Cheyn Shmuel Shmidman, Dan Bareket, Moshe Koppel, Reut, Tsarfaty

PDF

1 Repo

TL;DR

This paper investigates Hebrew pre-trained language models' ability to distinguish homographs, revealing they excel at segmentation and morphosyntactic analysis but are less effective for pure word-sense disambiguation, especially with increased ambiguity.

Contribution

The study provides a comprehensive evaluation of Hebrew PLMs on a novel homograph challenge set, highlighting their strengths and limitations in disambiguation tasks.

Findings

01

Hebrew PLMs outperform non-contextualized embeddings.

02

They are most effective for segmentation and morphosyntactic features.

03

Effectiveness decreases with higher ambiguity levels.

Abstract

Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Dicta-Israel-Center-for-Text-Analysis/EACL_2023
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.