Size vs. Structure in Training Corpora for Word Embedding Models:   Araneum Russicum Maximum and Russian National Corpus

Andrey Kutuzov; Maria Kunilovskaya

arXiv:1801.06407·cs.CL·January 22, 2018

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

Andrey Kutuzov, Maria Kunilovskaya

PDF

1 Repo

TL;DR

This study compares word embedding models trained on a large web corpus and a smaller, curated national corpus for Russian, analyzing their performance on semantic similarity tasks and discussing dataset issues.

Contribution

It introduces a new corrected version of the Multilingual SimLex999 dataset and provides a detailed comparison of corpus size, quality, and their impact on word embedding performance.

Findings

01

RNC yields more robust embeddings than web corpus

02

The corrected dataset improves semantic similarity evaluation

03

Model performance varies across different parts of the dataset

Abstract

In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

natasha/navec
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.