Thyme: Think Beyond Images

Yi-Fan Zhang; Xingyu Lu; Shukang Yin; Chaoyou Fu; Wei Chen; Xiao Hu; Bin Wen; Kaiyu Jiang; Changyi Liu; Tianke Zhang; Haonan Fan; Kaibing Chen; Jiankang Chen; Haojie Ding; Kaiyu Tang; Zhang Zhang; Liang Wang; Fan Yang; Tingting Gao; Guorui Zhou

arXiv:2508.11630·cs.CV·August 18, 2025

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou

PDF

2 Models 2 Datasets 3 Reviews

TL;DR

Thyme introduces a novel paradigm for multimodal large language models that autonomously generate and execute diverse image processing and computational operations, significantly improving performance on perception and reasoning benchmarks.

Contribution

It presents a new approach enabling models to perform rich image manipulations and computations through code, with a two-stage training strategy and a novel RL algorithm, GRPO-ATS.

Findings

01

Significant performance improvements on nearly 20 benchmarks.

02

Enhanced capabilities in high-resolution perception tasks.

03

Improved complex reasoning performance.

Abstract

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The training dataset size is huge (500K + 10K). Including diverse editing tools. 2. The SFT data and RL data are separate. Very clear. 3. The experiments are comprehensive. Many benchmarks are included. Signal is clear -- adding this data is helpful.

Weaknesses

Main concerns are about experiment missing baselines, and missing citations. Experiments results in Table need more baseline models. Could you also add GPT 4 and 5, GPT O3,, Gemini 2.5 Pro, Claude, [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models], [Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning]...etc, as baselines? Missing Citations on related works (because many concepts in Thyme first appeared in the

Reviewer 02Rating 6Confidence 4

Strengths

- Experiments are well-executed, thorough, and explore a range of design decisions. - Dataset contribution would be substantial and fill a gap in the open-source community. - Problem is well-motivated and timely. - Empirical benefits are convincing on a range of benchmarks.

Weaknesses

- Fully missing discussion of a substantial line of related work in code execution for visual reasoning: Visual Programming - Gupta et al and ViperGPT - Suris et al first introduced the idea of code execution at inference for visual reasoning, prior to OpenAI's 'thinking with images'; subsequent work such as Visual Sketchpad by Hu et al extended this to a larger set of image manipulations (including on some benchmarks reported here, albeit relying on proprietary models) - If I understand correct

Reviewer 03Rating 4Confidence 4

Strengths

1. Comprehensive and well-engineered system: The paper presents a complete pipeline integrating multimodal understanding, code generation, and sandbox execution, with careful attention to practical details like error handling and security constraints. 2. Rich functionality beyond existing work: Thyme supports diverse image manipulations (cropping, rotation, contrast enhancement) and mathematical computations, offering genuine multimodal reasoning capabilities. 3. Well-designed data pipeline wi

Weaknesses

1. The provided ablation studies (Table 5, 6, 7) suggest that the performance gains from several of these carefully designed components are limited. 2. The paper does not mention that the curated dataset and the specialized sandbox environment which are claimed as key components will be open-sourced. Whether these contents are open source is an important basis for judging the contribution of this paper.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.