Building the Language Resource for a Cebuano-Filipino Neural Machine   Translation System

Kristine Mae Adlaon; Nelson Marcos

arXiv:2110.15716·cs.CL·November 1, 2021

Building the Language Resource for a Cebuano-Filipino Neural Machine Translation System

Kristine Mae Adlaon, Nelson Marcos

PDF

Open Access

TL;DR

This paper details the creation of a parallel Cebuano-Filipino corpus from biblical texts and Wikipedia, employing correction techniques and topic segmentation, to facilitate neural machine translation with promising BLEU score results.

Contribution

It introduces a novel parallel corpus for Cebuano-Filipino translation, combining correction methods and topic segmentation for low-resource language translation.

Findings

01

BLEU scores differ between the two corpora

02

Correction techniques improved translation quality

03

Topic segmentation aids in sentence extraction

Abstract

Parallel corpus is a critical resource in machine learning-based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing the translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in the translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification