ToolRM: Towards Agentic Tool-Use Reward Modeling

Renhao Li; Jianhong Tu; Yang Su; Yantao Liu; Fei Huang; Hamid Alinejad-Rokny; Derek F. Wong; Junyang Lin; Min Yang

arXiv:2510.26167·cs.AI·January 14, 2026

ToolRM: Towards Agentic Tool-Use Reward Modeling

Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang

PDF

2 Models 2 Datasets

TL;DR

ToolRM introduces specialized reward models for tool-use tasks in LLMs, improving reward judgment accuracy and enabling scalable, efficient agentic AI applications with broad evaluation and practical utility.

Contribution

The paper presents a novel pipeline for creating high-quality preference data and introduces ToolRM, a family of reward models tailored for tool-use scenarios in LLMs, with extensive evaluation benchmarks.

Findings

01

ToolRM models achieve up to 17.94% higher accuracy in reward judgments.

02

Generative ToolRM generalizes to critique tasks like self-correction.

03

Inference-time scaling reduces output token usage by over 66%.

Abstract

Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench $_{B F C L}$ , a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.