Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR
Agent-RewardBench introduces a comprehensive benchmark for evaluating reward modeling in multimodal large language models across perception, planning, and safety in real-world scenarios, emphasizing step-level assessment and diverse challenges.
Contribution
This work presents a novel benchmark specifically designed to evaluate reward modeling capabilities in multimodal agents across multiple real-world scenarios and at granular step levels.
Findings
State-of-the-art models show limited reward modeling performance.
The benchmark covers perception, planning, and safety in diverse scenarios.
Manual verification ensures high-quality evaluation data.
Abstract
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman-Automation Interaction and Safety · Multi-Agent Systems and Negotiation · Advanced Text Analysis Techniques
