Multilingual Coreference Resolution in Low-resource South Asian Languages
Ritwik Mishra, Pooja Desur, Rajiv Ratn Shah, Ponnurangam Kumaraguru

TL;DR
This paper introduces TransMuCoRes, a multilingual coreference resolution dataset for 31 South Asian languages, and evaluates models on Hindi, highlighting dataset creation, model performance, and evaluation challenges.
Contribution
It presents the first end-to-end coreference resolution evaluation on Hindi and introduces a new multilingual dataset for low-resource South Asian languages.
Findings
Best model achieved 64 LEA F1 and 68 CoNLL F1 scores on Hindi.
Nearly all translations passed sanity checks, with 75% alignment.
Current evaluation metrics have limitations for datasets with split antecedents.
Abstract
Coreference resolution involves the task of identifying text spans within a discourse that pertain to the same real-world entity. While this task has been extensively explored in the English language, there has been a notable scarcity of publicly accessible resources and models for coreference resolution in South Asian languages. We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages using off-the-shelf tools for translation and word-alignment. Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations. Using multilingual encoders, two off-the-shelf coreference resolution models were trained on a concatenation of TransMuCoRes and a Hindi coreference resolution dataset with manual annotations. The best performing model achieved a score of 64…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsALIGN
