TL;DR
S1-VL is a multimodal reasoning model for scientific domains that combines structured chain-of-thought reasoning with active image manipulation via code execution, achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper introduces S1-VL, a novel scientific multimodal reasoning model supporting both reasoning paradigms and a new data filtering strategy for effective training.
Findings
S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks.
The model outperforms existing systems on scientific reasoning benchmarks such as Physics and VRSBench.
The multi-stage filtering pipeline improves the quality of training data by reducing ineffective visual operations.
Abstract
We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
