Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Mukhlis Amien; Feng Chong; Huang Heyan

arXiv:2207.00552·cs.CL·July 4, 2022

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Mukhlis Amien, Feng Chong, Huang Heyan

PDF

Open Access

TL;DR

This paper introduces a rule-based sub-word segmentation method for Indonesian in neural machine translation, significantly reducing vocabulary size and improving translation quality without requiring corpus data.

Contribution

It presents a novel rule-based approach for Indonesian sub-word segmentation that enhances NMT performance and reduces vocabulary size without relying on corpus-based methods.

Findings

01

Vocabulary reduced by up to 57%

02

Translation quality improved by up to 5 BLEU points

03

Method is practical and corpus-independent

Abstract

Indonesian is an agglutinative language since it has a compounding process of word-formation. Therefore, the translation model of this language requires a mechanism that is even lower than the word level, referred to as the sub-word level. This compounding process leads to a rare word problem since the number of vocabulary explodes. We propose a strategy to address the unique word problem of the neural machine translation (NMT) system, which uses Indonesian as a pair language. Our approach uses a rule-based method to transform a word into its roots and accompanied affixes to retain its meaning and context. Using a rule-based algorithm has more advantages: it does not require corpus data but only applies the standard Indonesian rules. Our experiments confirm that this method is practical. It reduces the number of vocabulary significantly up to 57\%, and on the English to Indonesian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling