In the LLM era, Word Sense Induction remains unsolved

Anna Mosolova; Marie Candito; Carlos Ramisch

arXiv:2603.11686·cs.CL·March 13, 2026

In the LLM era, Word Sense Induction remains unsolved

Anna Mosolova, Marie Candito, Carlos Ramisch

PDF

Open Access 1 Video

TL;DR

This paper evaluates word sense induction methods, including LLM-based approaches, on a new dataset, revealing current limitations and the potential of data augmentation and lexicons, but showing the task remains unsolved.

Contribution

It introduces a new evaluation dataset respecting corpus polysemy, assesses LLM-based WSI methods, and analyzes the impact of data augmentation and lexicons, highlighting current challenges.

Findings

01

No unsupervised method outperforms the 'one cluster per lemma' heuristic.

02

Results vary across parts of speech.

03

Data augmentation and lexicons improve WSI performance.

Abstract

In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

In the LLM era, Word Sense Induction remains unsolved· underline

Taxonomy

TopicsNatural Language Processing Techniques · ICT in Developing Communities · Language and cultural evolution