Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Tianyi Men; Zhuoran Jin; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao

arXiv:2506.21252·cs.CL·June 27, 2025

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

Agent-RewardBench introduces a comprehensive benchmark for evaluating reward modeling in multimodal large language models across perception, planning, and safety in real-world scenarios, emphasizing step-level assessment and diverse challenges.

Contribution

This work presents a novel benchmark specifically designed to evaluate reward modeling capabilities in multimodal agents across multiple real-world scenarios and at granular step levels.

Findings

01

State-of-the-art models show limited reward modeling performance.

02

The benchmark covers perception, planning, and safety in diverse scenarios.

03

Manual verification ensures high-quality evaluation data.

Abstract

As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

quester-one/agent-rewardbench
noneOfficial

Videos

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents· underline

Taxonomy

TopicsHuman-Automation Interaction and Safety · Multi-Agent Systems and Negotiation · Advanced Text Analysis Techniques