SuperSim: a test set for word similarity and relatedness in Swedish

Simon Hengchen; Nina Tahmasebi

arXiv:2104.05228·cs.CL·April 13, 2021·5 cites

SuperSim: a test set for word similarity and relatedness in Swedish

Simon Hengchen, Nina Tahmasebi

PDF

Open Access

TL;DR

SuperSim is a comprehensive Swedish word similarity and relatedness test set created with expert judgments, enabling better evaluation of language models like Word2Vec, fastText, and GloVe on Swedish language tasks.

Contribution

The paper introduces SuperSim, the first large-scale Swedish word similarity and relatedness test set with expert annotations, and provides baseline evaluations of popular models.

Findings

01

SuperSim contains 1,360 word pairs with expert annotations.

02

Baseline models show varying performance on the dataset.

03

The dataset and code are publicly released for future research.

Abstract

Language models are notoriously difficult to evaluate. We release SuperSim, a large-scale similarity and relatedness test set for Swedish built with expert human judgments. The test set is composed of 1,360 word-pairs independently judged for both relatedness and similarity by five annotators. We evaluate three different models (Word2Vec, fastText, and GloVe) trained on two separate Swedish datasets, namely the Swedish Gigaword corpus and a Swedish Wikipedia dump, to provide a baseline for future comparison. We release the fully annotated test set, code, baseline models, and data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsfastText