Exploiting Parallel Corpora to Improve Multilingual Embedding based   Document and Sentence Alignment

Dilan Sachintha; Lakmali Piyarathna; Charith Rajitha; Surangika; Ranathunga

arXiv:2106.06766·cs.CL·June 15, 2021·1 cites

Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment

Dilan Sachintha, Lakmali Piyarathna, Charith Rajitha, Surangika, Ranathunga

PDF

Open Access

TL;DR

This paper introduces a weighting mechanism leveraging small-scale parallel corpora to enhance multilingual sentence representations for document and sentence alignment, especially benefiting low-resource languages like Sinhala and Tamil.

Contribution

It proposes a novel weighting mechanism that improves alignment performance by utilizing available parallel corpora for low-resource languages.

Findings

01

Significant improvement in alignment accuracy for Sinhala and Tamil.

02

Effective use of small-scale parallel data enhances multilingual representations.

03

Public release of dataset and source code facilitates further research.

Abstract

Multilingual sentence representations pose a great advantage for low-resource languages that do not have enough data to build monolingual models on their own. These multilingual sentence representations have been separately exploited by few research for document and sentence alignment. However, most of the low-resource languages are under-represented in these pre-trained models. Thus, in the context of low-resource languages, these models have to be fine-tuned for the task at hand, using additional data sources. This paper presents a weighting mechanism that makes use of available small-scale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment. Experiments are conducted with respect to two low-resource languages, Sinhala and Tamil. Results on a newly created dataset of Sinhala-English, Tamil-English, and Sinhala-Tamil…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications