OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

Yongrui Chen; Zhiqiang Liu; Jing Yu; Lin Ren; Nan Hu; Xinbang Dai; Jiajun Liu; Jiazhen Kang; Shenyu Zhang; Xinda Wang; Keyan Ding; Pengfei Shen; Haolei Zhu; Hongjie Deng; Yisong Wang; Tongtong Wu; Sheng Bi; Wen Zhang; Tianxing Wu; Qiu Ji; Haofen Wang; Wenliang Chen; Huajun Chen; Guilin Qi

arXiv:2506.12577·cs.CL·June 17, 2025

OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases

Yongrui Chen, Zhiqiang Liu, Jing Yu, Lin Ren, Nan Hu, Xinbang Dai, Jiajun Liu, Jiazhen Kang, Shenyu Zhang, Xinda Wang, Keyan Ding, Pengfei Shen, Haolei Zhu, Hongjie Deng, Yisong Wang, Tongtong Wu, Sheng Bi, Wen Zhang, Tianxing Wu, Qiu Ji, Haofen Wang, Wenliang Chen, Huajun Chen

PDF

Open Access

TL;DR

OneEval is a comprehensive benchmark designed to evaluate large language models' reasoning abilities across diverse structured knowledge modalities and domains, revealing persistent limitations and guiding future improvements.

Contribution

The paper introduces OneEval, a new benchmark with 4,019 instances across multiple knowledge modalities and domains, to systematically assess and analyze LLMs' structured reasoning capabilities.

Findings

01

LLMs show limited performance on structured reasoning tasks, with the best models achieving only 32.2% accuracy on the hardest subset.

02

Performance declines as the complexity of the knowledge base increases, from 53% in textual reasoning to 25% in formal logic.

03

Extended reasoning chains yield diminishing returns, indicating the need for models to adapt reasoning depth to task complexity.

Abstract

Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Rough Sets and Fuzzy Logic