KoBALT: Korean Benchmark For Advanced Linguistic Tasks

Hyopil Shin; Sangah Lee; Dongjun Jang; Wooseok Song; Jaeyoon Kim; Chaeyoung Oh; Hyemi Jo; Youngchae Ahn; Sihyun Oh; Hyohyeong Chang; Sunkyoung Kim; Jinsik Lee

arXiv:2505.16125·cs.CL·May 23, 2025

KoBALT: Korean Benchmark For Advanced Linguistic Tasks

Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Youngchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, Jinsik Lee

PDF

Open Access 1 Datasets

TL;DR

KoBALT is a comprehensive, linguistically-motivated Korean benchmark with 700 questions across five domains, designed to evaluate large language models' true understanding of Korean language phenomena.

Contribution

It introduces a novel, expert-curated benchmark with minimal data overlap, addressing limitations of existing benchmarks for Korean language understanding evaluation.

Findings

01

Top model achieved 61% accuracy overall.

02

Performance varied significantly across linguistic domains.

03

Strong correlation between KoBALT scores and human judgments.

Abstract

We introduce KoBALT (Korean Benchmark for Advanced Linguistic Tasks), a comprehensive linguistically-motivated benchmark comprising 700 multiple-choice questions spanning 24 phenomena across five linguistic domains: syntax, semantics, pragmatics, phonetics/phonology, and morphology. KoBALT is designed to advance the evaluation of large language models (LLMs) in Korean, a morphologically rich language, by addressing the limitations of conventional benchmarks that often lack linguistic depth and typological grounding. It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora, substantially mitigating the risk of data contamination and allowing a more robust assessment of true language understanding. Our evaluation of 20 contemporary LLMs reveals significant performance disparities, with the highest-performing model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

snunlp/KoBALT-700
dataset· 263 dl
263 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification