Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang

TL;DR
This paper introduces Progressive Reward Shaping and Value-based Sampling Policy Optimization to improve reinforcement learning with large language models, leading to better convergence, stability, and domain generalization in tool-augmented reasoning tasks.
Contribution
It proposes two novel techniques, PRS and VSPO, that address reward sparsity and gradient degradation, enhancing agent performance in complex reasoning tasks.
Findings
PRS outperforms binary rewards in QA tasks.
VSPO achieves faster convergence and higher accuracy.
Combined methods improve domain generalization.
Abstract
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, which reducing sample efficiency. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
