KLUE: Korean Language Understanding Evaluation
Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han,, Jangwon Park, Chisung Song, Junseong Kim, Yongsook Song, Taehwan Oh, Joohong, Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo,, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, Seongbo Jang

TL;DR
KLUE is a comprehensive Korean language understanding benchmark with diverse tasks, datasets, and pretrained models, designed to advance Korean NLP research and facilitate future multilingual benchmarks.
Contribution
This paper introduces KLUE, a new Korean NLU benchmark with multiple tasks, datasets, evaluation metrics, pretrained models, and insights from initial experiments.
Findings
KLUE-RoBERTa-large outperforms other models.
Minimal performance degradation when removing PII from training data.
Effective use of BPE with morpheme-level pre-tokenization.
Abstract
We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, SemanticTextual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproducing baseline models on KLUE and thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Norod78/hebrew-bad_wiki-gpt_neo-tinymodel· 510 dl510 dl
- 🤗SEBIS/code_trans_t5_small_program_synthese_transfer_learning_finetunemodel· 41 dl· ♡ 541 dl♡ 5
- 🤗geralt/MechDistilGPT2model· 22 dl22 dl
- 🤗klue/bert-basemodel· 57k dl· ♡ 6257k dl♡ 62
- 🤗klue/roberta-basemodel· 110k dl· ♡ 46110k dl♡ 46
- 🤗klue/roberta-largemodel· 15k dl· ♡ 6215k dl♡ 62
- 🤗klue/roberta-smallmodel· 1.6k dl· ♡ 141.6k dl♡ 14
- 🤗typeform/distilbert-base-uncased-mnlimodel· 184k dl· ♡ 44184k dl♡ 44
- 🤗paust/pko-t5-smallmodel· 33 dl· ♡ 733 dl♡ 7
- 🤗paust/pko-t5-basemodel· 326 dl· ♡ 21326 dl♡ 21
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsByte Pair Encoding
