Boosting Deductive Reasoning with Step Signals In RLHF
Jialian Li, Yipin Zhang, Wei Shen, Yuzi Yan, Jian Xie, Dong Yan

TL;DR
This paper introduces MuseD, an automated data generation method for multi-step deductive reasoning in LLMs, which improves logical reasoning capabilities through RLHF training and controlled difficulty levels.
Contribution
We developed MuseD, a novel automated data generation approach for multi-step reasoning, enabling effective training and evaluation of LLMs' deductive capabilities.
Findings
RLHF training with MuseD data improves logical reasoning in LLMs
MuseD enables control over instruction complexity for training
Models show enhanced multi-step reasoning abilities
Abstract
Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.
Peer Reviews
Decision·Submitted to ICLR 2025
* The paper starts with a good premise, of generating synthetic reasoning data for training LLMs. Methods to generate high-quality synthetic data at scale are generally becoming more popular in the field, and I predict their importance to keep rising. * The paper explores post-training, which is less explored in the reasoning space than fine-tuning approaches * Results with Llama 8B seem to be mostly positive, and the authors compared to running RLHF on Ultrafeedback alone
* The paper doesn't show a single example of the data the method is able to generate (not even in the appendix). The explanation (Section 4) is a bit hard to follow, with details all only given in text. It would perhaps be more productive to discuss concrete examples, even if the details of the algorithm are discussed at a higher level (these can likely be inferred from seeing a few representative prompts). If I missed this, I'd appreciate if the authors point me to where such examples are. * Fr
1. Deductive reasoning, especially syllogistic reasoning, is foundational for tackling more complex tasks. Fine-tuning on the proposed dataset meaningfully improves the model's ability to apply correct syllogistic reasoning. 2. Using a step-based signal for reinforcement learning is a reasonable approach. For tasks where sequential steps are crucial, step-level feedback can help the model learn accurate reasoning pathways more effectively during the RL process. 3. The experiments are thorough,
1. While the use of step-level feedback or process-based rewards is intuitive, it is not novel and has been previously introduced by works such as [1] with subsequent advancements in [2, 3]. Automating label generation is crucial for training a reward model; however, since syllogistic reasoning is formal and symbolic, the potential step formats are highly constrained. Consequently, the step-level feedback here may be trivial, as identifying correct and relevant steps is straightforward. 2. The p
This work proposes a simple but effective method to enhance the logical reasoning ability of LLMs. The performance of LLM (Llama3-8B) improves significantly on several logical reasoning tasks.
The contributions and experiments of this work do not seem solid. Firstly, the methods and forms of the generated logical reasoning datasets seem overly simple, only reflecting multi-step features, and do not appear to be significantly different from previous works, like ProofWriter. Secondly, the PPO-based model are only compared with the original baseline LLM (LLaMA3) and do not include comparisons with other baseline models. In fact, many fine-tuned smaller models have also achieved good p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge · AI-based Problem Solving and Planning · Bayesian Modeling and Causal Inference
