TL;DR
This paper introduces RLTR, a reinforcement learning framework that improves LLM agent planning by focusing on tool-use sequences, leading to better planning and response quality without requiring verifiable data.
Contribution
RLTR decouples training to optimize planning separately using tool-use rewards, addressing data scarcity and objective imbalance in LLM agent training.
Findings
Achieved 8%-12% improvement in planning performance.
Enhanced overall response quality by 5%-6%.
Demonstrated effectiveness over end-to-end training baselines.
Abstract
The functionality of Large Language Model (LLM) agents is primarily determined by two capabilities: action planning and answer summarization. The former, action planning, is the core capability that dictates an agent's performance. However, prevailing training paradigms employ end-to-end, multi-objective optimization that jointly trains both capabilities. This paradigm faces two critical challenges: imbalanced optimization objective allocation and scarcity of verifiable data, making it difficult to enhance the agent's planning capability. To address these challenges, we propose Reinforcement Learning with Tool-use Rewards (RLTR), a novel framework that decouples the training process to enable a focused, single-objective optimization of the planning module. Crucially, RLTR introduces a reward signal based on tool-use completeness to directly evaluate the quality of tool invocation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
