Sinhala-English Parallel Word Dictionary Dataset

Kasun Wickramasinghe; Nisansa de Silva

arXiv:2308.02234·cs.CL·September 26, 2023

Sinhala-English Parallel Word Dictionary Dataset

Kasun Wickramasinghe, Nisansa de Silva

PDF

Open Access 1 Repo

TL;DR

This paper introduces three open-source English-Sinhala parallel word dictionaries to support multilingual NLP tasks for this low-resource language, addressing the lack of such datasets.

Contribution

The work provides the first free, open, and publicly available parallel English-Sinhala word dictionaries, along with creation methodology and quality verification.

Findings

01

Datasets facilitate multilingual NLP for Sinhala-English tasks.

02

Experimental results confirm dataset quality and usability.

03

Resources are publicly accessible for research use.

Abstract

Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kasunw22/sinhala-para-dict
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Network Packet Processing and Optimization