Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Yifei Xu; Tusher Chakraborty; Srinagesh Sharma; Leonardo Nunes; Swati Sharma; Kate Drakos Demopulos; Emre K{\i}c{\i}man; Songwu Lu; Ranveer Chandra

arXiv:2506.13351·cs.CL·May 11, 2026

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre K{\i}c{\i}man, Songwu Lu, Ranveer Chandra

PDF

TL;DR

This paper introduces a new reinforcement learning framework for large language models that improves reasoning quality and task feasibility by combining token-level reflection rewards with rubric-based gating, leading to better performance on complex, unverifiable tasks.

Contribution

It proposes a novel constrained RL training method that aligns token-level reasoning with task criteria and enforces feasibility constraints, enhancing learning efficiency and answer quality.

Findings

01

Outperforms strong baselines across four diverse datasets.

02

Achieves faster and more sample-efficient learning.

03

Respects task feasibility constraints effectively.

Abstract

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.