Evaluating the Logical Reasoning Abilities of Large Reasoning Models
Hanmeng Liu, Yiran Ding, Zhizhang Fu, Chaoli Zhang, Xiaozhang Liu, Yue Zhang

TL;DR
This paper introduces LogiEval, a comprehensive benchmark for assessing logical reasoning in large models, revealing their strengths and limitations across diverse reasoning types and highlighting persistent fundamental reasoning challenges.
Contribution
The paper presents LogiEval, a new holistic benchmark for logical reasoning, including LogiEval-Hard, a challenging subset to diagnose reasoning bottlenecks in large language models.
Findings
Models excel at argument analysis and analogical reasoning.
Models show uneven performance across reasoning types.
LogiEval-Hard reveals persistent reasoning failures.
Abstract
Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper is well-organized and easy to follow.
1. **Lack of Novelty**. There are too many benchmarks for evaluating logical reasoning, e.g. [1][2], even mutimodal one [3]. [1] JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models https://arxiv.org/abs/2510.18855 [2] LogicGame: Benchmarking Rule‑Based Reasoning Abilities of Large Language Models https://arxiv.org/abs/2408.15778 [3] MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs https://arxiv.org/abs/2505.21327 I believe
- Clear motivation addressing the gap between domain-specific benchmarks and fundamental reasoning evaluation - The paper is well-organzied and clearly written.
- Insufficient human performance analysis. - Limited case study. It is important to eval the benchmark via case studies. - What is the major difference between this work and previous logical evaluation benchmarks, such as GLoRE?
1. Unifying four reasoning types and ten task formats in a single, exam-sourced benchmark. 2. Preserves original language (English/Chinese), maintaining linguistic nuance and avoiding translation bias. 3. Evaluation of 7 top 2025 LLMs with consistent prompting and answer extraction. Includes human performance baselines and statistical significance testing.
1. Most items are multiple-choice; open-ended or proof-based reasoning is underrepresented. 2. Human accuracy is derived from historical exam pass rates, which may not reflect controlled, per-item performance. 3. Does not assess reasoning validity or explanation quality—only final answer correctness.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning in Healthcare
