Simple o3: Towards Interleaved Vision-Language Reasoning

Ye Wang; Qianglong Chen; Zejun Li; Siyuan Wang; Shijie Guo; Zhirui Zhang; Zhongyu Wei

arXiv:2508.12109·cs.CV·August 19, 2025

Simple o3: Towards Interleaved Vision-Language Reasoning

Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shijie Guo, Zhirui Zhang, Zhongyu Wei

PDF

Open Access

TL;DR

Simple o3 introduces an end-to-end multimodal reasoning framework that integrates dynamic visual operations with linguistic reasoning, significantly improving vision-language task performance through a scalable data synthesis pipeline and interleaved reasoning strategies.

Contribution

The paper presents Simple o3, a novel approach combining visual transformations and linguistic reasoning with a new data synthesis pipeline and analysis of interleaved reasoning strategies.

Findings

01

Enhanced reasoning with additional visual tokens improves performance.

02

Reusing and magnifying images boosts visual reasoning and perception.

03

Cropping images based on visual grounding enhances focus on key regions.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques