MiMo-VL Technical Report
Xiaomi LLM-Core Team: Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng

TL;DR
MiMo-VL-7B models achieve state-of-the-art performance across diverse vision-language tasks through extensive pre-training and mixed reinforcement learning, setting new benchmarks in visual understanding and multimodal reasoning.
Contribution
Introduction of MiMo-VL-7B models with innovative training combining multi-stage pre-training and mixed RL, along with a comprehensive evaluation suite for reproducibility.
Findings
MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35/40 tasks.
Achieves 59.4 on OlympiadBench, surpassing larger models.
Sets new standards in GUI grounding with 56.1 on OSWorld-G.
Abstract
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗XiaomiMiMo/MiMo-VL-7B-RL-2508model· 104k dl· ♡ 91104k dl♡ 91
- 🤗XiaomiMiMo/MiMo-VL-7B-RLmodel· 2.0k dl· ♡ 1692.0k dl♡ 169
- 🤗XiaomiMiMo/MiMo-VL-7B-SFTmodel· 1.1k dl· ♡ 551.1k dl♡ 55
- 🤗XiaomiMiMo/MiMo-VL-7B-RL-GGUFmodel· 413 dl· ♡ 7413 dl♡ 7
- 🤗XiaomiMiMo/MiMo-VL-7B-SFT-GGUFmodel· 74 dl· ♡ 474 dl♡ 4
- 🤗XiaomiMiMo/MiMo-VL-7B-SFT-2508model· 2.2k dl· ♡ 362.2k dl♡ 36
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
