Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling
Jiaxuan Wang, Yulan Hu, Wenjin Yang, Zheng Pan, Xin Li, Lan-Zhe Guo

TL;DR
This paper introduces Plan-RewardBench, a benchmark for evaluating reward models in complex, tool-using agent scenarios, highlighting current challenges and failure modes in trajectory-level reward assessment.
Contribution
It provides a new benchmark for trajectory-level reward modeling in agentic systems, including diverse task families and diagnostic analyses of model performance.
Findings
All evaluated reward models struggle with long-horizon trajectories.
Performance drops significantly on complex planning and error recovery tasks.
Current models face substantial challenges in trajectory-level reward evaluation.
Abstract
In classical Reinforcement Learning from Human Feedback (RLHF), Reward Models (RMs) serve as the fundamental signal provider for model alignment. As Large Language Models evolve into agentic systems capable of autonomous tool invocation and complex reasoning, the paradigm of reward modeling faces unprecedented challenges -- most notably, the lack of benchmarks specifically designed to assess RM capabilities within tool-integrated environments. To address this gap, we present Plan-RewardBench, a trajectory-level preference benchmark designed to evaluate how well judges distinguish preferred versus distractor agent trajectories in complex tool-using scenarios. Plan-RewardBench covers four representative task families -- (i) Safety Refusal, (ii) Tool-Irrelevance / Unavailability, (iii) Complex Planning, and (iv) Robust Error Recovery -- comprising validated positive trajectories and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
