AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su; Jincheng Gao; Hangyu Guo; Zhenhua Liu; Lueyang Zhang; Xinyu Geng; Shijue Huang; Peng Xia; Guanyu Jiang; Cheng Wang; Yue Zhang; Yi R. Fung; Junxian He

arXiv:2602.23166·cs.CV·March 3, 2026

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He

PDF

Open Access 1 Datasets

TL;DR

AgentVista is a comprehensive benchmark designed to evaluate multimodal agents in complex, realistic scenarios involving multi-step visual reasoning and long-horizon tool use, revealing current model limitations.

Contribution

We introduce AgentVista, a new benchmark covering diverse realistic scenarios and tool interactions, to better evaluate and advance multimodal agent capabilities.

Findings

01

State-of-the-art models perform poorly on complex tasks

02

Even the best model achieves only 27.3% accuracy

03

Long-horizon tool use often exceeds 25 steps

Abstract

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Warrieryes/AgentVista
dataset· 179 dl
179 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Mobile Crowdsensing and Crowdsourcing