Beyond Word Boundaries: A Hebrew Coreference Benchmark and an Evaluation Protocol for Morphologically Complex Text
Refael Shaked Greenfeld, Reut Tsarfaty

TL;DR
This paper introduces KibutzR, a Hebrew coreference resolution dataset and evaluation protocol tailored for morphologically complex languages, revealing performance gaps in current models and highlighting the need for segmentation-aware approaches.
Contribution
It provides the first comprehensive Hebrew CR dataset with multi-level mention annotations and a new evaluation protocol addressing boundary discrepancies in MRLs.
Findings
LLMs perform worse on Hebrew than English
Performance drops on raw unsegmented Hebrew text
Smaller encoders outperform larger decoder models in Hebrew
Abstract
Coreference Resolution (CR) is a fundamental NLP task critical for long-form tasks as information extraction, summarization, and many business applications. However, CR methods originally designed for English struggle with Morphologically Rich Languages (MRLs), where mention boundaries do not necessarily align with word boundaries, and a single token may consist of multiple anaphors. CR modeling and evaluation protocols standardly assume that, as in English, words and mentions mostly align. However, this assumption breaks down in MRLs, particularly in the context of LLMs' raw-text processing and end-to-end tasks. To assess and address this challenge, we introduce {\em KibutzR}, the first comprehensive CR dataset for Modern Hebrew, an MRL rich with complex words and pronominal clitics. We deliver an annotated dataset that identifies mentions at word, sub-word and multi-word levels, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
