DPRK-BERT: The Supreme Language Model
Arda Akdemir, Yeojoo Jeon

TL;DR
DPRK-BERT is the first deep language model for the DPRK language, created by compiling a new corpus and fine-tuning an existing model, with improved performance and cross-lingual capabilities.
Contribution
This paper introduces the first DPRK language model by compiling a dedicated corpus and adapting a ROK language model, enabling NLP research for DPRK.
Findings
Significant performance improvements on DPRK datasets
Effective cross-lingual generalization between Korean dialects
Provision of NLP tools for DPRK language research
Abstract
Deep language models have achieved remarkable success in the NLP domain. The standard way to train a deep language model is to employ unsupervised learning from scratch on a large unlabeled corpus. However, such large corpora are only available for widely-adopted and high-resource languages and domains. This study presents the first deep language model, DPRK-BERT, for the DPRK language. We achieve this by compiling the first unlabeled corpus for the DPRK language and fine-tuning a preexisting the ROK language model. We compare the proposed model with existing approaches and show significant improvements on two DPRK datasets. We also present a cross-lingual version of this model which yields better generalization across the two Korean languages. Finally, we provide various NLP tools related to the DPRK language that would foster future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
