Why Not Simply Translate? A First Swedish Evaluation Benchmark for   Semantic Similarity

Tim Isbister; Magnus Sahlgren

arXiv:2009.03116·cs.CL·December 1, 2020·6 cites

Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity

Tim Isbister, Magnus Sahlgren

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces the first Swedish semantic similarity benchmark created via machine translation of an English dataset, evaluates various Swedish text representations, and discusses the limitations of this simple approach.

Contribution

It provides the first Swedish semantic similarity benchmark and compares different text representation models, highlighting native models' superiority over multilingual ones.

Findings

01

Native Swedish models outperform multilingual models.

02

Simple bag of words performs surprisingly well.

03

Translation-based dataset has inherent limitations.

Abstract

This paper presents the first Swedish evaluation benchmark for textual semantic similarity. The benchmark is compiled by simply running the English STS-B dataset through the Google machine translation API. This paper discusses potential problems with using such a simple approach to compile a Swedish evaluation benchmark, including translation errors, vocabulary variation, and productive compounding. Despite some obvious problems with the resulting dataset, we use the benchmark to compare the majority of the currently existing Swedish text representations, demonstrating that native models outperform multilingual ones, and that simple bag of words performs remarkably well.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

timpal0l/sts-benchmark-swedish
noneOfficial

Datasets

timpal0l/stsb_mt_sv
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification