Word Similarity Datasets for Thai: Construction and Evaluation

Ponrudee Netisopakul; Gerhard Wohlgenannt; Aleksei Pulich

arXiv:1904.04307·cs.CL·April 10, 2019

Word Similarity Datasets for Thai: Construction and Evaluation

Ponrudee Netisopakul, Gerhard Wohlgenannt, Aleksei Pulich

PDF

1 Repo

TL;DR

This paper introduces three Thai word similarity datasets derived from popular English datasets, enabling better evaluation of Thai word embeddings and highlighting challenges like out-of-vocabulary issues.

Contribution

The creation and release of three Thai word similarity datasets based on translation and re-rating of established English datasets, with baseline evaluations and a tool for model assessment.

Findings

01

High out-of-vocabulary rate in Thai embeddings

02

Datasets cover various difficulty levels and domains

03

Baseline evaluations demonstrate current model limitations

Abstract

Distributional semantics in the form of word embeddings are an essential ingredient to many modern natural language processing systems. The quantification of semantic similarity between words can be used to evaluate the ability of a system to perform semantic interpretation. To this end, a number of word similarity datasets have been created for the English language over the last decades. For Thai language few such resources are available. In this work, we create three Thai word similarity datasets by translating and re-rating the popular WordSim-353, SimLex-999 and SemEval-2017-Task-2 datasets. The three datasets contain 1852 word pairs in total and have different characteristics in terms of difficulty, domain coverage, and notion of similarity (relatedness vs.~similarity). These features help to gain a broader picture of the properties of an evaluated word embedding model. We include…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gwohlgen/thai_word_similarity
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.