ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
Pengbo Liu

TL;DR
ToolRLA introduces a multiplicative reward decomposition for tool-integrated agents, significantly improving task completion, reducing errors, and ensuring regulatory compliance in domain-specific applications.
Contribution
It presents a novel three-stage training pipeline with a fine-grained, multiplicative reward function that encodes multiple correctness dimensions for tool agents.
Findings
47% increase in task completion rate
63% reduction in tool invocation errors
93% reduction in regulatory violations
Abstract
Tool-integrated agents that interleave reasoning with API calls are promising for complex tasks, yet aligning them for high-stakes, domain-specific deployment remains challenging: existing reinforcement learning approaches rely on coarse binary rewards that cannot distinguish tool selection errors from malformed parameters. We present ToolRLA, a three-stage post-training pipeline (SFT -> GRPO -> DPO) for domain-specific tool agents. The core contribution is a fine-grained reward function with multiplicative correctness decomposition spanning four dimensions -- format validity, tool selection, parameter accuracy, and regulatory compliance -- that encodes domain priority orderings as inductive biases in the reward landscape. Deployed on a financial advisory copilot (80+ advisors, 1,200+ daily queries), ToolRLA achieves over three months: a 47% improvement in task completion rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Mobile Crowdsensing and Crowdsourcing
