Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

TL;DR
This paper introduces Generative Adversarial Reasoner, a framework that improves large language model reasoning by co-training a reasoner and discriminator adversarially, leading to better accuracy and reasoning quality in mathematical tasks.
Contribution
It presents a novel adversarial reinforcement learning approach with a structured review schedule to enhance LLM reasoning and sample efficiency, outperforming strong baselines on mathematical benchmarks.
Findings
Significant accuracy improvements on AIME24 and other benchmarks.
Enhanced reasoning quality with dense, step-level rewards.
Flexible reward shaping for various objectives.
Abstract
Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper presents a novel approach to LLM reasoning improvement by adapting GAN-style adversarial training to the reasoning domain. The slice-level evaluation mechanism is creative and addresses computational efficiency concerns while maintaining granular feedback. The joint training paradigm where both reasoner and discriminator co-evolve is innovative compared to static reward models. 2. The experimental methodology is solid with comprehensive evaluation across seven mathematical benchmar
1. Limited Theoretical Foundation: While the empirical results are strong, the paper lacks theoretical analysis of the joint training dynamics. There's no convergence analysis or guarantees about the co-evolution process, which is concerning given the potential for reward hacking mentioned by the authors. 2. Slice Segmentation Methodology: The slice partitioning approach appears somewhat ad-hoc. The paper doesn't provide sufficient justification for these choices or analysis of how sensitive th
1. The paper's core idea and method hold research value. Judging from the experimental results, the proposed method indeed brings about a relatively clear improvement compared to standard reinforcement learning methods. 2. The paper conducts thorough ablation studies, sufficiently demonstrating the effectiveness of each component within the designed framework. Furthermore, the discriminator's truncation experiment design gives consideration to its impact on training efficiency. 3. The paper in
1. First, I believe the paper aims to propose a denser reward strategy. However, the final method is still based on GRPO, which assigns a reward to the entire sequence. Therefore, although the proposed method can provide step-level rewards, it essentially still provides a denser Outcome Reward during training. I think the authors' description of their contribution requires careful consideration, as the PRM is not used in a step-level way. Admittedly, in Section 4.5, the authors mention the possi
The paper introduces a co-training approach where the reasoner and discriminator evolve together, providing dense, calibrated, step-level rewards that significantly improve credit assignment over sparse outcome-based methods. It shows improvements across multiple mathematical reasoning benchmarks (e.g., +7.3 on AIME24), outperforming strong RL baselines while maintaining comparable training efficiency. It implements compute-efficient innovations like slice-level evaluation and response length
The adversarial training setup may lead to reward hacking, where the discriminator and reasoner adapt to each other's weaknesses rather than improving reasoning quality. How does this paper mitigate the risk of reward hacking? It is recommended that the method proposed in this paper be validated on a broader range of tasks, such as coding and commonsense reasoning tasks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Explainable Artificial Intelligence (XAI)
