RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Zijing Zhang; Ziyang Chen; Mingxiao Li; Zhaopeng Tu; Xiaolong Li

arXiv:2507.22844·cs.LG·July 31, 2025

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, Xiaolong Li

PDF

3 Models 3 Reviews

TL;DR

RLVMR introduces a reinforcement learning framework that rewards explicit, verifiable reasoning processes, significantly improving robustness and efficiency in long-horizon tasks by enhancing reasoning quality and reducing redundant actions.

Contribution

The paper presents RLVMR, a novel RL framework that incorporates process-level supervision and rule-based rewards for better reasoning in long-horizon tasks, achieving state-of-the-art results.

Findings

01

Achieved 83.6% success on challenging unseen tasks.

02

Reduced redundant actions and improved error recovery.

03

Enhanced reasoning quality and robustness of agents.

Abstract

The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps, such as planning, exploration, and reflection, and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is clearly motivated with a detailed investigation on the ALFWorld benchmark. 2. The paper is clearly presented, and the method is clearly explained. 3. The paper shows significantly improvement upon baseline methods on the two benchmarks they use.

Weaknesses

See my questions below.

Reviewer 02Rating 6Confidence 4

Strengths

* Clearly targets inefficient exploration and quantifies it with invalid action rate and repetitive action rate, tying process quality to task success. * Simple, practical method: explicit meta-reasoning tags, verifiable meta-reasoning rewards, and a tag-grouped relative advantage blended with a trajectory-level relative advantage in a clipped objective. * Consistent gains across base models and benchmarks, especially for the harder split; also shows shorter, more stable solution paths. * Goes b

Weaknesses

* The claim that this work is “the first study offering a definitive explanation and comprehensive analysis of the inefficient exploration issue” overstates its novelty. The idea that outcome-only RL reinforces flawed reasoning paths has already been recognised in prior works on process reward models and step-level or action-type-conditioned rewards. These earlier studies also analyse how intermediate reasoning quality affects exploration efficiency and generalisation. * Despite claiming “verif

Reviewer 03Rating 4Confidence 4

Strengths

- clear motivation and well presented - Improved efficiency and generalization

Weaknesses

- lack of theoretical analysis of the composite reward which can leads to reward hacking - Dependence on teacher annotation, since one powerful teacher LLM is used without guarantee - Limited ablation on tag definitions, missing ablation on the contribution of individual meta-reasoning tags

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.