Russian word sense induction by clustering averaged word embeddings

Andrey Kutuzov

arXiv:1805.02258·cs.CL·May 8, 2018·1 cites

Russian word sense induction by clustering averaged word embeddings

Andrey Kutuzov

PDF

Open Access 1 Repo

TL;DR

This paper presents a simple clustering-based approach using averaged word embeddings for Russian word sense induction, demonstrating competitive results and highlighting the effectiveness of small, balanced corpora for training embeddings.

Contribution

Introduces a naive clustering method with pre-trained embeddings for Russian word sense induction, showing small, balanced corpora can outperform larger noisy datasets.

Findings

01

Achieved 2nd place on wiki-wiki dataset

02

Small, balanced corpora can outperform large noisy data in sense induction

03

Simple averaging and clustering can be effective for word sense tasks

Abstract

The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akutuzov/russian_wsi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification