Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim,, Sukyung Lee, Yungi Kim, Hwalsuk Lee

TL;DR
This paper presents the Open Ko-LLM Leaderboard and Ko-H5 Benchmark for evaluating Korean language models, emphasizing the importance of private test sets and comprehensive evaluation for linguistic diversity.
Contribution
It introduces a new evaluation framework and benchmark for Korean LLMs, including private test sets and analysis methods, to improve model assessment.
Findings
Private test sets enhance evaluation robustness
Correlation between Ko-H5 scores and model performance
Temporal analysis reveals trends in Korean LLM development
Abstract
This paper introduces the Open Ko-LLM Leaderboard and the Ko-H5 Benchmark as vital tools for evaluating Large Language Models (LLMs) in Korean. Incorporating private test sets while mirroring the English Open LLM Leaderboard, we establish a robust evaluation framework that has been well integrated in the Korean LLM community. We perform data leakage analysis that shows the benefit of private test sets along with a correlation study within the Ko-H5 benchmark and temporal analyses of the Ko-H5 score. Moreover, we present empirical support for the need to expand beyond set benchmarks. We hope the Open Ko-LLM Leaderboard sets precedent for expanding LLM evaluation to foster more linguistic diversity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods · Natural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
