STEP3-VL-10B Technical Report
Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen

TL;DR
STEP3-VL-10B is a compact, open-source multimodal foundation model that achieves high performance and complex reasoning capabilities comparable to much larger models through innovative training and reasoning strategies.
Contribution
The paper introduces a novel training and reasoning framework for a 10B parameter model that rivals larger models in multimodal intelligence.
Findings
Achieves 92.2% on MMBench and 80.11% on MMMU
Surpasses larger models in complex reasoning tasks
Provides an open-source, reproducible baseline
Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10-20 larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stepfun-ai/Step3-VL-10Bmodel· 217k dl· ♡ 402217k dl♡ 402
- 🤗seanbailey518/Step3-VL-10B-GGUFmodel· 1.9k dl· ♡ 81.9k dl♡ 8
- 🤗stepfun-ai/Step3-VL-10B-FP8model· 809 dl· ♡ 10809 dl♡ 10
- 🤗stepfun-ai/Step3-VL-10B-Basemodel· 124 dl· ♡ 49124 dl♡ 49
- 🤗QuantTrio/Step3-VL-10B-AWQmodel· 13 dl· ♡ 513 dl♡ 5
- 🤗cyankiwi/Step3-VL-10B-AWQ-4bitmodel· 28 dl28 dl
- 🤗cyankiwi/Step3-VL-10B-AWQ-8bitmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗np-deploys/Step3-VL-10B-AWQ-4bitmodel· 596 dl596 dl
- 🤗Ujjwal-Tyagi/Step3-VL-10Bmodel· 11 dl11 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Reinforcement Learning in Robotics
