S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Qingxiao Li; Lifeng Xu; QingLi Wang; Yudong Bai; Mingwei Ou; Shu Hu; Nan Xu

arXiv:2604.21409·cs.CV·April 24, 2026

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

Qingxiao Li, Lifeng Xu, QingLi Wang, Yudong Bai, Mingwei Ou, Shu Hu, Nan Xu

PDF

2 Models

TL;DR

S1-VL is a multimodal reasoning model for scientific domains that combines structured chain-of-thought reasoning with active image manipulation via code execution, achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper introduces S1-VL, a novel scientific multimodal reasoning model supporting both reasoning paradigms and a new data filtering strategy for effective training.

Findings

01

S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks.

02

The model outperforms existing systems on scientific reasoning benchmarks such as Physics and VRSBench.

03

The multi-stage filtering pipeline improves the quality of training data by reducing ineffective visual operations.

Abstract

We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.