OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases
Yongrui Chen, Zhiqiang Liu, Jing Yu, Lin Ren, Nan Hu, Xinbang Dai, Jiajun Liu, Jiazhen Kang, Shenyu Zhang, Xinda Wang, Keyan Ding, Pengfei Shen, Haolei Zhu, Hongjie Deng, Yisong Wang, Tongtong Wu, Sheng Bi, Wen Zhang, Tianxing Wu, Qiu Ji, Haofen Wang, Wenliang Chen, Huajun Chen

TL;DR
OneEval is a comprehensive benchmark designed to evaluate large language models' reasoning abilities across diverse structured knowledge modalities and domains, revealing persistent limitations and guiding future improvements.
Contribution
The paper introduces OneEval, a new benchmark with 4,019 instances across multiple knowledge modalities and domains, to systematically assess and analyze LLMs' structured reasoning capabilities.
Findings
LLMs show limited performance on structured reasoning tasks, with the best models achieving only 32.2% accuracy on the hardest subset.
Performance declines as the complexity of the knowledge base increases, from 53% in textual reasoning to 25% in formal logic.
Extended reasoning chains yield diminishing returns, indicating the need for models to adapt reasoning depth to task complexity.
Abstract
Large Language Models (LLMs) have demonstrated substantial progress on reasoning tasks involving unstructured text, yet their capabilities significantly deteriorate when reasoning requires integrating structured external knowledge such as knowledge graphs, code snippets, or formal logic. This limitation is partly due to the absence of benchmarks capable of systematically evaluating LLM performance across diverse structured knowledge modalities. To address this gap, we introduce \textbf{\textsc{OneEval}}, a comprehensive benchmark explicitly designed to assess the knowledge-intensive reasoning capabilities of LLMs across four structured knowledge modalities, unstructured text, knowledge graphs, code, and formal logic, and five critical domains (general knowledge, government, science, law, and programming). \textsc{OneEval} comprises 4,019 carefully curated instances and includes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Rough Sets and Fuzzy Logic
