Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Jianghao Su; Xia Zeng; Luhui Liu; Chao Luo; Ye Chen; Zhuoran Zhuang

arXiv:2512.07478·cs.CL·January 21, 2026

Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang

PDF

Open Access 1 Video

TL;DR

This paper introduces Progressive Reward Shaping and Value-based Sampling Policy Optimization to improve reinforcement learning with large language models, leading to better convergence, stability, and domain generalization in tool-augmented reasoning tasks.

Contribution

It proposes two novel techniques, PRS and VSPO, that address reward sparsity and gradient degradation, enhancing agent performance in complex reasoning tasks.

Findings

01

PRS outperforms binary rewards in QA tasks.

02

VSPO achieves faster convergence and higher accuracy.

03

Combined methods improve domain generalization.

Abstract

Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, which reducing sample efficiency. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization· underline

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications