LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision
Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, Wynne Hsu

TL;DR
LogicReward introduces a step-wise logical supervision framework using theorem proving to enhance reasoning accuracy and faithfulness in large language models, outperforming existing models on reasoning tasks.
Contribution
The paper presents a novel reward system, LogicReward, that enforces logical correctness at each reasoning step using theorem proving, along with Autoformalization with Soft Unification to improve formalization.
Findings
Model trained with LogicReward surpasses GPT-4o and o4-mini on reasoning tasks.
LogicReward improves reasoning faithfulness and generalizability.
Provides reliable reward signals without ground-truth labels.
Abstract
Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper tackles a good problem in LLM research, and the contribution is timely and generally principled. - LogicReward achieves strong empirical performance. I appreciate that the authors evaluate the models/methods on a pretty large set of logical reasoning benchmarks. - The study includes fair ablation studies and comparisons (e.g., Table 2 and Figure 4), corroborating the advantage of LogicReward. - The generalizability analysis is interesting and encouraging (Sec. 5.3): LogicReward does
- The related work section (Sec. 2) is incomplete. Much prior work has done on LLM logical reasoning using a range of different techniques, from earlier work like [1] to neurosymbolic approaches like [2, 3]. I think a dedicated paragraph is needed (could be in between "general reasoning" and "reward models") given the topic of this paper. - While LogicReward seems a good fit for logical reasoning tasks, many kinds of reasoning are harder to (auto)formalize. Do the authors expect their method to
1. The use of symbolic methods for process supervision is well motivated and directly addresses limitations of outcome rewards. By enforcing step level logical validity, the approach reduces reward hacking and aligns training with faithful reasoning. 2. Incorporating theorem prover feedback in multiple iterations expands the exploration space and yields a more informative training signal. Iterative prover guided refinement improves coverage of candidate solutions, lowers variance, and encourages
1. The paper omits several studies that combine LLMs with provers, for example Quan et al., “Verification and Refinement of Natural Language Explanations through LLM and Symbolic Theorem Proving,” EMNLP 2024. The task focus is closely related. Please situate LogicReward within this line of work and add direct comparisons where possible. 2. It is unclear how much the method’s gains depend on the chosen prover and logic. If a different proof assistant or formal language is used, does LogicReward r
1. The primary strength is the integration of a symbolic theorem prover into the reward loop. This moves beyond the limitations of purely outcome-based or probabilistic process-based rewards, providing a more rigorous "formal and symbolic assessment of logical validity". 2. The method directly targets and improves reasoning faithfulness, a critical need for high-stakes domains like medicine and law. The paper's manual evaluation found that the LogicReward-trained model could produce more faithfu
1. The authors explicitly acknowledge that the LogicReward labeling process, which involves running a theorem prover, is "relatively slower than approaches that only compare final outcomes". 2. This speed limitation presents a significant challenge for scalability, particularly for "on-policy reinforcement learning scenarios" (like PPO or GRPO) that require fast, in-the-loop reward generation. The paper's use of SFT and DPO cleverly sidesteps this by being offline methods, but it leaves the sy
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Mathematics, Computing, and Information Processing
