Evaluating the Logical Reasoning Abilities of Large Reasoning Models

Hanmeng Liu; Yiran Ding; Zhizhang Fu; Chaoli Zhang; Xiaozhang Liu; Yue Zhang

arXiv:2505.11854·cs.AI·May 20, 2025

Evaluating the Logical Reasoning Abilities of Large Reasoning Models

Hanmeng Liu, Yiran Ding, Zhizhang Fu, Chaoli Zhang, Xiaozhang Liu, Yue Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LogiEval, a comprehensive benchmark for assessing logical reasoning in large models, revealing their strengths and limitations across diverse reasoning types and highlighting persistent fundamental reasoning challenges.

Contribution

The paper presents LogiEval, a new holistic benchmark for logical reasoning, including LogiEval-Hard, a challenging subset to diagnose reasoning bottlenecks in large language models.

Findings

01

Models excel at argument analysis and analogical reasoning.

02

Models show uneven performance across reasoning types.

03

LogiEval-Hard reveals persistent reasoning failures.

Abstract

Large reasoning models, often post-trained on long chain-of-thought (long CoT) data with reinforcement learning, achieve state-of-the-art performance on mathematical, coding, and domain-specific reasoning benchmarks. However, their logical reasoning capabilities - fundamental to human cognition and independent of domain knowledge - remain understudied. To address this gap, we introduce LogiEval, a holistic benchmark for evaluating logical reasoning in large reasoning models. LogiEval spans diverse reasoning types (deductive, inductive, analogical, and abductive) and task formats (e.g., logical sequence, argument analysis), sourced from high-quality human examinations (e.g., LSAT, GMAT). Our experiments demonstrate that modern reasoning models excel at 4-choice argument analysis problems and analogical reasoning, surpassing human performance, yet exhibit uneven capabilities across…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

This paper is well-organized and easy to follow.

Weaknesses

1. **Lack of Novelty**. There are too many benchmarks for evaluating logical reasoning, e.g. [1][2], even mutimodal one [3]. [1] JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models https://arxiv.org/abs/2510.18855 [2] LogicGame: Benchmarking Rule‑Based Reasoning Abilities of Large Language Models https://arxiv.org/abs/2408.15778 [3] MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs https://arxiv.org/abs/2505.21327 I believe

Reviewer 02Rating 2Confidence 4

Strengths

- Clear motivation addressing the gap between domain-specific benchmarks and fundamental reasoning evaluation - The paper is well-organzied and clearly written.

Weaknesses

- Insufficient human performance analysis. - Limited case study. It is important to eval the benchmark via case studies. - What is the major difference between this work and previous logical evaluation benchmarks, such as GLoRE?

Reviewer 03Rating 4Confidence 2

Strengths

1. Unifying four reasoning types and ten task formats in a single, exam-sourced benchmark. 2. Preserves original language (English/Chinese), maintaining linguistic nuance and avoiding translation bias. 3. Evaluation of 7 top 2025 LLMs with consistent prompting and answer extraction. Includes human performance baselines and statistical significance testing.

Weaknesses

1. Most items are multiple-choice; open-ended or proof-based reasoning is underrepresented. 2. Human accuracy is derived from historical exam pass rates, which may not reflect controlled, per-item performance. 3. Does not assess reasoning validity or explanation quality—only final answer correctness.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Machine Learning in Healthcare