Sinhala-English Parallel Word Dictionary Dataset
Kasun Wickramasinghe, Nisansa de Silva

TL;DR
This paper introduces three open-source English-Sinhala parallel word dictionaries to support multilingual NLP tasks for this low-resource language, addressing the lack of such datasets.
Contribution
The work provides the first free, open, and publicly available parallel English-Sinhala word dictionaries, along with creation methodology and quality verification.
Findings
Datasets facilitate multilingual NLP for Sinhala-English tasks.
Experimental results confirm dataset quality and usability.
Resources are publicly accessible for research use.
Abstract
Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Network Packet Processing and Optimization
