Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre K{\i}c{\i}man, Songwu Lu, Ranveer Chandra

TL;DR
This paper introduces a new reinforcement learning framework for large language models that improves reasoning quality and task feasibility by combining token-level reflection rewards with rubric-based gating, leading to better performance on complex, unverifiable tasks.
Contribution
It proposes a novel constrained RL training method that aligns token-level reasoning with task criteria and enforces feasibility constraints, enhancing learning efficiency and answer quality.
Findings
Outperforms strong baselines across four diverse datasets.
Achieves faster and more sample-efficient learning.
Respects task feasibility constraints effectively.
Abstract
Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
