DPRK-BERT: The Supreme Language Model

Arda Akdemir; Yeojoo Jeon

arXiv:2112.00567·cs.CL·December 2, 2021

DPRK-BERT: The Supreme Language Model

Arda Akdemir, Yeojoo Jeon

PDF

Open Access

TL;DR

DPRK-BERT is the first deep language model for the DPRK language, created by compiling a new corpus and fine-tuning an existing model, with improved performance and cross-lingual capabilities.

Contribution

This paper introduces the first DPRK language model by compiling a dedicated corpus and adapting a ROK language model, enabling NLP research for DPRK.

Findings

01

Significant performance improvements on DPRK datasets

02

Effective cross-lingual generalization between Korean dialects

03

Provision of NLP tools for DPRK language research

Abstract

Deep language models have achieved remarkable success in the NLP domain. The standard way to train a deep language model is to employ unsupervised learning from scratch on a large unlabeled corpus. However, such large corpora are only available for widely-adopted and high-resource languages and domains. This study presents the first deep language model, DPRK-BERT, for the DPRK language. We achieve this by compiling the first unlabeled corpus for the DPRK language and fine-tuning a preexisting the ROK language model. We compare the proposed model with existing approaches and show significant improvements on two DPRK datasets. We also present a cross-lingual version of this model which yields better generalization across the two Korean languages. Finally, we provide various NLP tools related to the DPRK language that would foster future research.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods