# SART - Similarity, Analogies, and Relatedness for Tatar Language: New   Benchmark Datasets for Word Embeddings Evaluation

**Authors:** Albina Khusainova, Adil Khan, and Ad\'in Ram\'irez Rivera

arXiv: 1904.00365 · 2019-04-02

## TL;DR

This paper introduces new benchmark datasets for evaluating word embeddings in the Tatar language, addressing resource scarcity for underrepresented languages and enabling better semantic and syntactic modeling.

## Contribution

It presents three novel Tatar-specific datasets for similarity, relatedness, and analogies, expanding beyond translations to include language-specific features.

## Key findings

- State-of-the-art models perform worse on Tatar datasets compared to English.
- The datasets reveal language-specific challenges in semantic modeling.
- Evaluation highlights the need for tailored embeddings for underrepresented languages.

## Abstract

There is a huge imbalance between languages currently spoken and corresponding resources to study them. Most of the attention naturally goes to the "big" languages: those which have the largest presence in terms of media and number of speakers. Other less represented languages sometimes do not even have a good quality corpus to study them. In this paper, we tackle this imbalance by presenting a new set of evaluation resources for Tatar, a language of the Turkic language family which is mainly spoken in Tatarstan Republic, Russia.   We present three datasets: Similarity and Relatedness datasets that consist of human scored word pairs and can be used to evaluate semantic models; and Analogies dataset that comprises analogy questions and allows to explore semantic, syntactic, and morphological aspects of language modeling. All three datasets build upon existing datasets for the English language and follow the same structure. However, they are not mere translations. They take into account specifics of the Tatar language and expand beyond the original datasets. We evaluate state-of-the-art word embedding models for two languages using our proposed datasets for Tatar and the original datasets for English and report our findings on performance comparison.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.00365/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/1904.00365/full.md

---
Source: https://tomesphere.com/paper/1904.00365