DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang

TL;DR
DisCO introduces a discriminative constrained optimization framework to improve large reasoning models by eliminating question difficulty bias and stabilizing training, leading to significant performance gains over existing methods.
Contribution
The paper proposes DisCO, a novel discriminative learning-based reinforcement method that overcomes limitations of GRPO, enhancing reasoning model training stability and performance.
Findings
DisCO outperforms GRPO and DAPO by 6-7% on six benchmarks.
It effectively eliminates question difficulty bias in reasoning tasks.
DisCO stabilizes training dynamics with non-clipping scoring functions and constrained optimization.
Abstract
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Dialogue-Adaptive Pre-training Objective
