KR-BERT: A Small-Scale Korean-Specific Language Model
Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, Hyopil Shin

TL;DR
KR-BERT is a compact Korean-specific language model trained on a smaller dataset that outperforms larger models by capturing language-specific features effectively.
Contribution
The paper introduces a Korean-specific BERT model with a tailored vocabulary and tokenizer, achieving high performance with significantly less data and resources.
Findings
KR-BERT performs comparably or better than larger models.
Customized tokenization improves language representation.
Small dataset suffices for effective Korean language modeling.
Abstract
Since the appearance of BERT, recent works including XLNet and RoBERTa utilize sentence embedding models pre-trained by large corpora and a large number of parameters. Because such models have large hardware and a huge amount of data, they take a long time to pre-train. Therefore it is important to attempt to make smaller models that perform comparatively. In this paper, we trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset. Since Korean is one of the morphologically rich languages with poor resources using non-Latin alphabets, it is also important to capture language-specific linguistic phenomena that the Multilingual BERT model missed. We tested several tokenizers including our BidirectionalWordPiece Tokenizer and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · WordPiece · Byte Pair Encoding · Weight Decay · Attention Dropout · BERT · Dense Connections · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection
