ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dongwon Noh; Donghyeok Koh; Junghun Yuk; Gyuwan Kim; Jaeyong Lee; Kyungtae Lim; Cheoneum Park

arXiv:2505.16566·cs.CL·October 17, 2025

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park

PDF

1 Datasets 1 Video

TL;DR

ScholarBench is a challenging bilingual benchmark designed to evaluate large language models' abilities in academic reasoning, comprehension, and abstraction across multiple research domains and complex problem types.

Contribution

It introduces a scalable, domain-specific, bilingual benchmark with high-quality, expert-aligned questions for assessing academic reasoning in LLMs.

Findings

01

State-of-the-art models score below 0.55, indicating high difficulty.

02

Benchmark covers five problem types across eight research domains.

03

Includes over 10,000 bilingual examples in English and Korean.

Abstract

Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KISTI-KONI/ScholarBench
dataset· 72 dl
72 dl

Videos

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts· underline