TL;DR
This study compares word embedding models trained on a large web corpus and a smaller, curated national corpus for Russian, analyzing their performance on semantic similarity tasks and discussing dataset issues.
Contribution
It introduces a new corrected version of the Multilingual SimLex999 dataset and provides a detailed comparison of corpus size, quality, and their impact on word embedding performance.
Findings
RNC yields more robust embeddings than web corpus
The corrected dataset improves semantic similarity evaluation
Model performance varies across different parts of the dataset
Abstract
In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
