An Evaluation of Sindhi Word Embedding in Semantic Analogies and   Downstream Tasks

Wazir Ali; Saifullah Tumrani; Jay Kumar; Tariq Rahim Soomro

arXiv:2408.15720·cs.CL·August 29, 2024

An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Wazir Ali, Saifullah Tumrani, Jay Kumar, Tariq Rahim Soomro

PDF

Open Access

TL;DR

This paper introduces a new Sindhi word embedding corpus and evaluates various embedding algorithms, demonstrating that CBOW and skip-gram outperform GloVe and fastText in semantic analogy and downstream tasks.

Contribution

It presents a large Sindhi corpus and a comprehensive evaluation of embedding algorithms, highlighting the effectiveness of CBOW and skip-gram for Sindhi language tasks.

Findings

01

CBOW and skip-gram outperform GloVe and fastText in evaluations

02

New Sindhi corpus with over 61 million words

03

Effective preprocessing pipeline developed for Sindhi text

Abstract

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsfastText · GloVe Embeddings