An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks
Wazir Ali, Saifullah Tumrani, Jay Kumar, Tariq Rahim Soomro

TL;DR
This paper introduces a new Sindhi word embedding corpus and evaluates various embedding algorithms, demonstrating that CBOW and skip-gram outperform GloVe and fastText in semantic analogy and downstream tasks.
Contribution
It presents a large Sindhi corpus and a comprehensive evaluation of embedding algorithms, highlighting the effectiveness of CBOW and skip-gram for Sindhi language tasks.
Findings
CBOW and skip-gram outperform GloVe and fastText in evaluations
New Sindhi corpus with over 61 million words
Effective preprocessing pipeline developed for Sindhi text
Abstract
In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsfastText · GloVe Embeddings
