RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Gaotang Li; Bhavana Dalvi Mishra; Zifeng Wang; Jun Yan; Yanfei Chen; Chun-Liang Li; Long T. Le; Rujun Han; George Lee; Hanghang Tong; Chen-Yu Lee; Tomas Pfister

arXiv:2605.10899·cs.CL·May 12, 2026

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister

PDF

TL;DR

RubricEM introduces a rubric-guided reinforcement learning framework that decomposes policy stages, uses rubric feedback for dense credit assignment, and trains a reflection meta-policy for reusable guidance, advancing research agent capabilities.

Contribution

It proposes a novel rubric-guided RL approach with stagewise policy decomposition and reflection-based meta-policy evolution for long-form research tasks.

Findings

01

RubricEM-8B outperforms comparable open models on research benchmarks.

02

Stagewise rubric judgments provide denser semantic feedback.

03

Reflection meta-policy distills trajectories into reusable guidance.

Abstract

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.