Word Embedding based New Corpus for Low-resourced Language: Sindhi

Wazir Ali; Jay Kumar; Junyu Lu; Zenglin Xu

arXiv:1911.12579·cs.CL·January 1, 2021·6 cites

Word Embedding based New Corpus for Low-resourced Language: Sindhi

Wazir Ali, Jay Kumar, Junyu Lu, Zenglin Xu

PDF

Open Access

TL;DR

This paper develops a large Sindhi corpus and trains high-quality word embeddings using state-of-the-art algorithms, addressing the resource scarcity for Sindhi NLP applications.

Contribution

It introduces a sizable Sindhi corpus and trains multiple word embedding models, providing a valuable resource for low-resourced language NLP development.

Findings

01

Sindhi word embeddings outperform SdfastText in intrinsic evaluations.

02

The corpus contains over 61 million words from web sources.

03

Preprocessing pipeline effectively cleans noisy web data.

Abstract

Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsfastText · GloVe Embeddings