ToolRM: Towards Agentic Tool-Use Reward Modeling
Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang

TL;DR
ToolRM introduces specialized reward models for tool-use tasks in LLMs, improving reward judgment accuracy and enabling scalable, efficient agentic AI applications with broad evaluation and practical utility.
Contribution
The paper presents a novel pipeline for creating high-quality preference data and introduces ToolRM, a family of reward models tailored for tool-use scenarios in LLMs, with extensive evaluation benchmarks.
Findings
ToolRM models achieve up to 17.94% higher accuracy in reward judgments.
Generative ToolRM generalizes to critique tasks like self-correction.
Inference-time scaling reduces output token usage by over 66%.
Abstract
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
