Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator
Mukhlis Amien, Feng Chong, Huang Heyan

TL;DR
This paper introduces a rule-based sub-word segmentation method for Indonesian in neural machine translation, significantly reducing vocabulary size and improving translation quality without requiring corpus data.
Contribution
It presents a novel rule-based approach for Indonesian sub-word segmentation that enhances NMT performance and reduces vocabulary size without relying on corpus-based methods.
Findings
Vocabulary reduced by up to 57%
Translation quality improved by up to 5 BLEU points
Method is practical and corpus-independent
Abstract
Indonesian is an agglutinative language since it has a compounding process of word-formation. Therefore, the translation model of this language requires a mechanism that is even lower than the word level, referred to as the sub-word level. This compounding process leads to a rare word problem since the number of vocabulary explodes. We propose a strategy to address the unique word problem of the neural machine translation (NMT) system, which uses Indonesian as a pair language. Our approach uses a rule-based method to transform a word into its roots and accompanied affixes to retain its meaning and context. Using a rule-based algorithm has more advantages: it does not require corpus data but only applies the standard Indonesian rules. Our experiments confirm that this method is practical. It reduces the number of vocabulary significantly up to 57\%, and on the English to Indonesian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
