Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning

Zhiwei Li; Yong Hu; Wenqing Wang

arXiv:2508.19598·cs.LG·August 28, 2025

Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning

Zhiwei Li, Yong Hu, Wenqing Wang

PDF

1 Video

TL;DR

This paper introduces RLTR, a reinforcement learning framework that improves LLM agent planning by focusing on tool-use sequences, leading to better planning and response quality without requiring verifiable data.

Contribution

RLTR decouples training to optimize planning separately using tool-use rewards, addressing data scarcity and objective imbalance in LLM agent training.

Findings

01

Achieved 8%-12% improvement in planning performance.

02

Enhanced overall response quality by 5%-6%.

03

Demonstrated effectiveness over end-to-end training baselines.

Abstract

The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent's performance. However, prevailing training paradigms employ end-to-end, multi-objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent's planning capability. To address these challenges, we propose Reinforcement Learning with Tool-use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool-use completeness to directly evaluate the quality of tool invocation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Encouraging Good Processes Without the Need for Good Answers: Reinforcement Learning for LLM Agent Planning· underline