KMMLU: Measuring Massive Multitask Language Understanding in Korean

Guijin Son; Hanwool Lee; Sungdong Kim; Seungone Kim and; Niklas Muennighoff; Taekyoon Choi; Cheonbok Park; Kang Min Yoo and; Stella Biderman

arXiv:2402.11548·cs.CL·June 7, 2024·1 cites

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim and, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo and, Stella Biderman

PDF

Open Access 4 Datasets 1 Video

TL;DR

KMMLU is a comprehensive Korean language understanding benchmark with original exam questions, revealing current LLMs' limited performance and highlighting the need for further development in Korean NLP models.

Contribution

This paper introduces KMMLU, a new Korean benchmark with original questions, and evaluates multiple LLMs, showing significant performance gaps and the need for improved Korean language models.

Findings

01

Public LLMs score around 50.5% on KMMLU.

02

Proprietary models like GPT-4 score below 60%.

03

Korean-specific LLMs perform worse than multilingual models.

Abstract

We propose KMMLU, a new Korean benchmark with 35,030 expert-level multiple-choice questions across 45 subjects ranging from humanities to STEM. While prior Korean benchmarks are translated from existing English benchmarks, KMMLU is collected from original Korean exams, capturing linguistic and cultural aspects of the Korean language. We test 27 public and proprietary LLMs and observe the best public model to score 50.5%, leaving significant room for improvement. This model was primarily trained for English and Chinese, not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and HyperCLOVA X do not exceed 60%. This suggests that further work is needed to improve LLMs for Korean, and we believe KMMLU offers the appropriate tool to track this progress. We make our dataset publicly available on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

KMMLU: Measuring Massive Multitask Language Understanding in Korean· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsLinear Layer · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection