Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao, Geyang Guo, Xingxing Zhang, Nancy F. Chen, Shafiq Joty,, Furu Wei

TL;DR
This paper introduces a method to generate pseudo feedback for reasoning tasks using test case evaluations, improving large language models' reasoning performance without relying on human-labeled datasets.
Contribution
It proposes a novel pseudo feedback generation approach based on test cases, enhancing preference optimization for reasoning tasks in LLMs.
Findings
Significant performance improvements on mathematical reasoning benchmarks.
Outperforms several existing models and approaches.
Effective pseudo feedback method applicable to multiple reasoning domains.
Abstract
Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated test cases. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for…
Peer Reviews
Decision·ICLR 2025 Spotlight
* Creates the pseudo feedback framework in a way that combines both math and code reasoning tasks. * Because the pipeline for generating pseudo feedback is entirely automated, the process is scalable and less expensive than recreating such a process with humans would be. * The experiments are robust, conducted with three open LLMs, and with multiple combinations of techniques across two domains. * The results of the experiments are not only significant, but also well-explained. The authors do a
* The primary weaknesses I see with this approach are the limitations caused by self-consistency over self-generated pseudo feedback. I could see improvement over iterations being marginal for more challenging math/coding problems, as the self-consistent "answer" for them is likely incorrect.
* The paper presents a novel approach to preference optimization by leveraging pseudo feedback. * The experiments are well-designed and comprehensive, covering both mathematical reasoning and coding tasks. The results show substantial improvements over baseline models and even surpass some state-of-the-art models, indicating the high quality of the proposed methods. * The paper is well-structured and clearly written. The methodology is explained in detail, and the experimental setup and results
Please see the questions below.
1. Comprehensive experiments across a range of reasoning tasks and LLM models. 2. Clear presentation and good formalization. 3. Empirical results show the method’s effectiveness.
1. Limited Novelty at ICLR Level: The technique of leveraging unit tests as feedback for direct preference optimization (DPO/PPO) is already established in existing research. Previous works, such as CodeRL [1], and more recent studies [2][3], have incorporated unit tests and compiler feedback within reinforcement learning (RL)/PPO frameworks for code generation. Similarly, fine-grained feedback with DPO has been applied in mathematical reasoning [4], and LLM-based feedback with DPO has been
Code & Models
Videos
Taxonomy
TopicsMulti-Criteria Decision Making
MethodsBalanced Selection
