Thyme: Think Beyond Images
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou

TL;DR
Thyme introduces a novel paradigm for multimodal large language models that autonomously generate and execute diverse image processing and computational operations, significantly improving performance on perception and reasoning benchmarks.
Contribution
It presents a new approach enabling models to perform rich image manipulations and computations through code, with a two-stage training strategy and a novel RL algorithm, GRPO-ATS.
Findings
Significant performance improvements on nearly 20 benchmarks.
Enhanced capabilities in high-resolution perception tasks.
Improved complex reasoning performance.
Abstract
Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The training dataset size is huge (500K + 10K). Including diverse editing tools. 2. The SFT data and RL data are separate. Very clear. 3. The experiments are comprehensive. Many benchmarks are included. Signal is clear -- adding this data is helpful.
Main concerns are about experiment missing baselines, and missing citations. Experiments results in Table need more baseline models. Could you also add GPT 4 and 5, GPT O3,, Gemini 2.5 Pro, Claude, [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models], [Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning]...etc, as baselines? Missing Citations on related works (because many concepts in Thyme first appeared in the
- Experiments are well-executed, thorough, and explore a range of design decisions. - Dataset contribution would be substantial and fill a gap in the open-source community. - Problem is well-motivated and timely. - Empirical benefits are convincing on a range of benchmarks.
- Fully missing discussion of a substantial line of related work in code execution for visual reasoning: Visual Programming - Gupta et al and ViperGPT - Suris et al first introduced the idea of code execution at inference for visual reasoning, prior to OpenAI's 'thinking with images'; subsequent work such as Visual Sketchpad by Hu et al extended this to a larger set of image manipulations (including on some benchmarks reported here, albeit relying on proprietary models) - If I understand correct
1. Comprehensive and well-engineered system: The paper presents a complete pipeline integrating multimodal understanding, code generation, and sandbox execution, with careful attention to practical details like error handling and security constraints. 2. Rich functionality beyond existing work: Thyme supports diverse image manipulations (cropping, rotation, contrast enhancement) and mathematical computations, offering genuine multimodal reasoning capabilities. 3. Well-designed data pipeline wi
1. The provided ablation studies (Table 5, 6, 7) suggest that the performance gains from several of these carefully designed components are limited. 2. The paper does not mention that the curated dataset and the specialized sandbox environment which are claimed as key components will be open-sourced. Whether these contents are open source is an important basis for judging the contribution of this paper.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
