Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou; Jiayi Ye; Zipeng Ling; Yufei Han; Yue Huang; Haomin Zhuang; Zhenwen Liang; Kehan Guo; Taicheng Guo; Xiangqi Wang; Xiangliang Zhang

arXiv:2506.04810·cs.CL·October 10, 2025

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study

Yujun Zhou, Jiayi Ye, Zipeng Ling, Yufei Han, Yue Huang, Haomin Zhuang, Zhenwen Liang, Kehan Guo, Taicheng Guo, Xiangqi Wang, Xiangliang Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FineLogic, a detailed evaluation framework for logical reasoning in LLMs, revealing how different supervision styles influence reasoning quality and process, and providing insights for improving LLM reasoning capabilities.

Contribution

The paper proposes a novel fine-grained evaluation framework and analyzes the impact of various supervision formats on LLM reasoning abilities.

Findings

01

Natural language supervision improves out-of-distribution generalization.

02

Symbolic supervision enhances structural soundness of reasoning steps.

03

Fine-tuning mainly refines step-by-step reasoning rather than answer convergence.

Abstract

Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles: one in natural language and three symbolic variants. We find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yujunzhou/logical
pytorchOfficial

Videos

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications