SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek; Harpal Karlcut; Muhammad Shezad; Nijaguna Darshana; Abhishek Mane; Syam Bondada; Raghav Sikka; Ulvi Mammadov; Rauf Allahverdiyev; Sriram Purighella; Paridhi Gupta; Muhinyia Ndegwa; Haim Dubossarsky

arXiv:2505.23714·cs.CL·July 23, 2025

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods

Roksana Goworek, Harpal Karlcut, Muhammad Shezad, Nijaguna Darshana, Abhishek Mane, Syam Bondada, Raghav Sikka, Ulvi Mammadov, Rauf Allahverdiyev, Sriram Purighella, Paridhi Gupta, Muhinyia Ndegwa, Haim Dubossarsky

PDF

Open Access 1 Video

TL;DR

This paper introduces new sense-annotated datasets for ten low-resource languages to improve cross-lingual transfer and polysemy disambiguation, utilizing a semi-automatic annotation method and WiC-based evaluation.

Contribution

It presents the first sense-annotated datasets for low-resource languages and a semi-automatic annotation approach to facilitate dataset creation.

Findings

01

Datasets enable better cross-lingual transfer evaluation.

02

Semi-automatic annotation improves efficiency and quality.

03

Results show the importance of targeted datasets for polysemy disambiguation.

Abstract

This paper addresses the critical need for high-quality evaluation datasets in low-resource languages to advance cross-lingual transfer. While cross-lingual transfer offers a key strategy for leveraging multilingual pretraining to expand language technologies to understudied and typologically diverse languages, its effectiveness is dependent on quality and suitable benchmarks. We release new sense-annotated datasets of sentences containing polysemous words, spanning ten low-resource languages across diverse language families and scripts. To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method. The utility of the datasets is demonstrated through Word-in-Context (WiC) formatted experiments that evaluate transfer on these low-resource languages. Results highlight the importance of targeted dataset creation and evaluation for effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods· underline

Taxonomy

TopicsICT in Developing Communities · Natural Language Processing Techniques · Multilingual Education and Policy