TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning
Ming Li, Jike Zhong, Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Yuxiang Lai, Chen Wei, Konstantinos Psounis, Kaipeng Zhang

TL;DR
TIR-Bench is a new comprehensive benchmark designed to evaluate advanced agentic thinking-with-images capabilities across diverse tasks, revealing the challenges and requirements for models to perform complex, tool-dependent visual reasoning.
Contribution
It introduces TIR-Bench, a novel benchmark with 13 tasks for evaluating complex reasoning with images, and provides a comparative analysis of 22 multimodal models.
Findings
TIR-Bench is universally challenging for current models.
Strong performance requires genuine thinking-with-images capabilities.
Agentic fine-tuning shows potential benefits.
Abstract
The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper elevates thinking-with-images from simple localization and cropping capabilities to more complex and diverse ones. 2. The authors set up a tool-using scenario, experimenting with state-of-the-art models o4-mini and o3 on the proposed benchmark, achieving the expected performance improvements.
1. Table 2 only evaluates a limited number of open-source models, covering only Illava, Qwen2.5-VL, and InternVL3. Many large-scale open-source multimodal models have not been extensively evaluated. 2. The authors did not attempt to configure tool-using on open-source models with larger parameters, such as those with over 32 bytes of parameters. Performing such evaluations on the proposed benchmark would provide a clearer understanding of the actual capabilities of the open-source models.
- The task design is comprehensive, covering 13 diverse tasks that span a broad range of visual reasoning abilities. Each task forces models to manipulate images rather than passively describing them, enabling a faithful evaluation of thinking-with-images reasoning. - The benchmark is sufficiently challenging and ensures it is a useful, long-term suite for measuring MLLMs’ reasoning ability. - A central contribution of TIR-Bench is that it explicitly investigates how models interact with images,
- TIR-Bench is designed to evaluate models’ tool-based visual reasoning ability, but many evaluated models (e.g., LLaVA, Qwen2.5-VL, InternVL) cannot execute code or invoke tools. Their inclusion mainly serves as a static baseline rather than a true test of “thinking-with-images.” - 1215 examples are divided into 13 tasks, making several tasks have modest data sizes. It raises concerns about the benchmark robustness given the data scale. - Using GPT-4o to extract final answers can introduce pars
- The paper addresses a critical need for new benchmarks in this area. - It identifies more difficult tasks than previous datasets suitable for new code-execution based approaches. - It highlights issues with current state-of-the-art models. - Table 2 presents a wide selection of model evaluations. - Many qualitative examples from the benchmark are shown. - Preliminary experiments on SFT are presented.
- The data seems to need more verification; annotation by one student without at least one more pair of eyes checking may not be reliable. - Synthetic and hand-annotated data are mixed here, when they seem to evaluate markedly different capabilities. Perhaps they should be switched? - The provenance of the data is unclear, posing potential issues for usage. - The figures showing model responses with interleaved thinking/code, like 4, 5, 24, 25 are difficult to parse.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Language, Metaphor, and Cognition
