In the LLM era, Word Sense Induction remains unsolved
Anna Mosolova, Marie Candito, Carlos Ramisch

TL;DR
This paper evaluates word sense induction methods, including LLM-based approaches, on a new dataset, revealing current limitations and the potential of data augmentation and lexicons, but showing the task remains unsolved.
Contribution
It introduces a new evaluation dataset respecting corpus polysemy, assesses LLM-based WSI methods, and analyzes the impact of data augmentation and lexicons, highlighting current challenges.
Findings
No unsupervised method outperforms the 'one cluster per lemma' heuristic.
Results vary across parts of speech.
Data augmentation and lexicons improve WSI performance.
Abstract
In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · ICT in Developing Communities · Language and cultural evolution
