Introducing two Vietnamese Datasets for Evaluating Semantic Models of   (Dis-)Similarity and Relatedness

Kim Anh Nguyen; Sabine Schulte im Walde; Ngoc Thang Vu

arXiv:1804.05388·cs.CL·April 20, 2018

Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness

Kim Anh Nguyen, Sabine Schulte im Walde, Ngoc Thang Vu

PDF

TL;DR

This paper introduces two Vietnamese datasets, ViCon and ViSim-400, designed to evaluate semantic similarity and relatedness in low-resource languages, validated through neural models and human ratings.

Contribution

The paper provides the first Vietnamese datasets for semantic similarity and relatedness, enabling better evaluation of models in low-resource language contexts.

Findings

01

Datasets are comparable to English benchmarks.

02

Neural models perform similarly on Vietnamese and English datasets.

03

Human ratings validate the datasets' quality.

Abstract

We present two novel datasets for the low-resource language Vietnamese to assess models of semantic similarity: ViCon comprises pairs of synonyms and antonyms across word classes, thus offering data to distinguish between similarity and dissimilarity. ViSim-400 provides degrees of similarity across five semantic relations, as rated by human judges. The two datasets are verified through standard co-occurrence and neural network models, showing results comparable to the respective English datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.