Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

TL;DR
Agent-X introduces a comprehensive benchmark for evaluating multi-step, multimodal reasoning in vision-centric tasks across diverse real-world environments, exposing current models' limitations.
Contribution
This work presents a large-scale, multi-environment benchmark with a novel step-level evaluation framework for assessing deep reasoning in vision-centric agents.
Findings
Current models achieve less than 50% success on multi-step vision tasks.
Existing models struggle with reasoning coherence and tool integration.
Agent-X reveals significant gaps in current multimodal reasoning capabilities.
Abstract
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to…
Peer Reviews
Decision·ICLR 2026 Poster
1) Covers six distinct domains and integrates both image and video data 2) Presents detailed analysis of failure modes of SOTA models 3) Evaluates not just end answers but intermediate reasoning, tool grounding, and logical coherence 4) Multiple metrics are reported in three distinct evaluation modes
1) Prompts need to be designed for LLM-as-judge. Is there an automated way to do this? Or do the authors need to manually create them every time a new metric is introduced? 2) Human involvement at multiple stages makes it challenging to scale such a data generation approach. Also this is prone to errors and inconsistency? Any discussion around this would provide clarity. 3) The dataset is built on top of publicly available datasets, it is hence unclear if this dataset is truely novel apart from
1. The paper addresses a timely and relevant problem by filling a clear gap in evaluating deep reasoning and tool use for multimodal agents, an increasingly important topic as LMMs evolve into embodied and interactive systems. 2. It presents a comprehensive benchmark design that spans six diverse domains and modalities, including images, multi-image comparisons, videos, and text. 3. The tasks are realistic rather than synthetic, enhancing ecological validity compared to prior benchmarks like GAI
1. The paper primarily integrates existing components i.e. LMM reasoning, multimodal datasets, and tool evaluation, without introducing a fundamentally new evaluation paradigm, making it appear incremental compared to GAIA, GTA, and MLGym. 2. The evaluation setup relies heavily on GPT-4o and Qwen-based automatic grading, raising bias and circularity concerns since the same model families are both evaluated and used as judges, with no quantitative inter-rater agreement reported. 3. Despite its cl
1. The authors have designed a comprehensive tool subset for the benchmark that covers most tools required for visual tasks. The evaluation criteria are also well-rounded, encompassing multiple dimensions of assessment. 2. The benchmark demonstrates a significant improvement in sample size compared to previous benchmarks, which represents a notable contribution of this work. 3. The benchmark proves to be quite challenging for most state-of-the-art open-source and closed-source models. From the r
1. The template format used in the paper appears to differ from the official template provided. I am uncertain whether this may violate ICLR's formatting requirements. I recommend that the authors carefully review and address any formatting issues. 2. The font in Figure 3(b) is quite blurry. 3. I am not familiar with dataset construction in the agent domain. Could you please explain why JSON-formatted dialogue output is adopted? What is the rationale behind this design choice?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Residual Connection · Layer Normalization · Adam · Dense Connections · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax
