FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Min Zhang

TL;DR
This paper introduces FAPO, a method that improves reasoning in language models by penalizing flawed reasoning patterns during reinforcement learning, leading to more reliable and stable performance without extra token costs.
Contribution
FAPO is a novel, parameter-free reward penalty method that detects and mitigates flawed reasoning in RL, enhancing model reliability and training stability.
Findings
FAPO improves correctness and reliability across multiple domains.
It stabilizes training without increasing token usage.
FAPO effectively detects flawed reasoning with GenRM.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which…
Peer Reviews
Decision·ICLR 2026 Poster
The problem tackled is of great significance to the LLM reasoning community. As RLVR becomes a dominant paradigm for advancing LLM capabilities, ensuring that the learned reasoning is not just correct but also reliable and transparent is crucial for safety and trustworthiness. FAPO provides a practical, efficient, and theoretically grounded solution that improves both the efficiency and the final quality of RL training. The release of code and benchmarks further enhances its impact. Experiments
While the results on mathematical reasoning and GPQA are strong, the paper's claims of broad domains would be more convincing with validation on a wider range of tasks. A key domain of interest is code generation, where verifiable rewards are common and flawed reasoning (e.g., code that passes specific tests but is buggy or inefficient) is a major concern. Demonstrating coding effectiveness would significantly broaden the method's impact and generalizability. The 20% training-time overhead, whi
1. The paper clearly identifies and analyzes a critical problem in RLVR. The preliminary study (Section 2.2) effectively demonstrates the "twofold effect" of flawed positives—acting as "stepping stones" early in training but hindering optimization later —providing a solid empirical foundation for the proposed solution. 2. The FAPO algorithm's adaptive reward penalty is simple yet theoretically grounded. The analysis in Appendix A demonstrates how this mechanism creates an automatic, parameter-f
1. The entire framework's effectiveness is contingent on the quality of the FAPO-Critic-85K dataset, which was labeled by a "teacher model" (Qwen3-32B). The GenRM can only learn to detect flaws that the teacher model can identify. This creates a fundamental performance ceiling; any subtle errors missed by the teacher will be propagated, and FAPO will fail to penalize them. 2. The GenRM is trained primarily on mathematical reasoning tasks. While it shows good performance on GPQA-Diamond, it is u
- **Well-motivated.** The paper clearly identifies *flawed-positive rollouts* as a pervasive yet unresolved failure mode in RLVR, illustrating their dynamics through quantitative trends (Figure 2) and motivating the need for process-level awareness beyond outcome-based rewards. This diagnosis provides a solid conceptual foundation for introducing FAPO. - **Sound theoretical formulation.** The theoretical sections (Sec. 3.2, Appendix A) formalize the reward-penalization mechanism using group-rela
- **[Soundness]** While the observations in Section 3 are insightful, the claimed causal relationship between flawed positives and performance gains (lines 190–192) requires further evidence. The presented figure only demonstrates a correlation, not causation, and additional controlled ablations would be needed to validate this conclusion. - **[Soundness]** The evaluation datasets are relatively small. AIME (30 samples) can be insufficient for robust reasoning assessment, and on the larger GPQA-
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
