HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

Anton Alekseev; Gulnara Kabaeva

arXiv:2411.10724·cs.CL·December 2, 2024

HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

Anton Alekseev, Gulnara Kabaeva

PDF

2 Repos

TL;DR

This paper introduces the first Kyrgyz language dataset for evaluating word embeddings, providing a new resource to assess the quality of word vector representations in Kyrgyz NLP tasks.

Contribution

It presents a novel 'silver standard' dataset for Kyrgyz word embeddings and validates its effectiveness through quality evaluation metrics.

Findings

01

The dataset enables effective assessment of Kyrgyz word embeddings.

02

Models trained on the dataset show promising evaluation results.

03

The dataset fills a gap in Kyrgyz NLP resources.

Abstract

One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed 'similarity'. This work introduces the first 'silver standard' dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset's suitability through quality evaluation metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.