Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park; Suyoung Park; JiA Kang; Jongyeon Park; Sangho Kim; Hyunji M. Park; Sumin Bae; Mingyu Kang; Jaejin Lee

arXiv:2510.24150·cs.CL·October 29, 2025

Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee

PDF

TL;DR

Ko-MuSR is a comprehensive Korean reasoning benchmark that evaluates multistep reasoning in long narratives, revealing cross-lingual generalization in large language models and the effectiveness of advanced prompting strategies.

Contribution

It introduces Ko-MuSR, the first detailed benchmark for multistep reasoning in Korean, and demonstrates how prompting strategies improve model performance.

Findings

01

Multilingual models outperform Korean-specific models in Korean reasoning tasks.

02

Prompting strategies significantly boost reasoning accuracy, nearing human performance.

03

Cross-lingual reasoning capabilities are evident in large language models.

Abstract

We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.