SuperSim: a test set for word similarity and relatedness in Swedish
Simon Hengchen, Nina Tahmasebi

TL;DR
SuperSim is a comprehensive Swedish word similarity and relatedness test set created with expert judgments, enabling better evaluation of language models like Word2Vec, fastText, and GloVe on Swedish language tasks.
Contribution
The paper introduces SuperSim, the first large-scale Swedish word similarity and relatedness test set with expert annotations, and provides baseline evaluations of popular models.
Findings
SuperSim contains 1,360 word pairs with expert annotations.
Baseline models show varying performance on the dataset.
The dataset and code are publicly released for future research.
Abstract
Language models are notoriously difficult to evaluate. We release SuperSim, a large-scale similarity and relatedness test set for Swedish built with expert human judgments. The test set is composed of 1,360 word-pairs independently judged for both relatedness and similarity by five annotators. We evaluate three different models (Word2Vec, fastText, and GloVe) trained on two separate Swedish datasets, namely the Swedish Gigaword corpus and a Swedish Wikipedia dump, to provide a baseline for future comparison. We release the fully annotated test set, code, baseline models, and data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsfastText
