Learning How to Translate North Korean through South Korean
Hwichan Kim, Sangwhan Moon, Naoaki Okazaki, and Mamoru Komachi

TL;DR
This paper develops a method to improve North Korean translation by creating a bilingual corpus from comparable data, enabling neural machine translation models to better handle North Korean inputs without extensive annotated data.
Contribution
The study introduces a novel approach to generate North Korean bilingual data from comparable corpora, enhancing NMT performance for North Korean translation without human annotation.
Findings
North Korean translation accuracy improves significantly with the proposed data creation method.
Automatic alignment methods suitable for North Korean are identified and validated.
North Korean NMT models outperform South Korean models in zero-shot translation settings.
Abstract
South and North Korea both use the Korean language. However, Korean NLP research has focused on South Korean only, and existing NLP systems of the Korean language, such as neural machine translation (NMT) models, cannot properly handle North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models. In this study, we create data for North Korean NMT models using a comparable corpus. First, we manually create evaluation data for automatic alignment and machine translation. Then, we investigate automatic alignment methods suitable for North Korean. Finally, we verify that a model trained by North Korean bilingual data without human annotation can significantly boost North Korean translation accuracy compared to existing South Korean models in zero-shot settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsKorean Peninsula Historical and Political Studies · Natural Language Processing Techniques · Innovation in Digital Healthcare Systems
