MiMo-VL Technical Report

Xiaomi LLM-Core Team: Zihao Yue; Zhenru Lin; Yifan Song; Weikun Wang; Shuhuai Ren; Shuhao Gu; Shicheng Li; Peidian Li; Liang Zhao; Lei Li; Kainan Bao; Hao Tian; Hailin Zhang; Gang Wang; Dawei Zhu; Cici; Chenhong He; Bowen Ye; Bowen Shen; Zihan Zhang; Zihan Jiang; Zhixian Zheng; Zhichao Song; Zhenbo Luo; Yue Yu; Yudong Wang; Yuanyuan Tian; Yu Tu; Yihan Yan; Yi Huang; Xu Wang; Xinzhe Xu; Xingchen Song; Xing Zhang; Xing Yong; Xin Zhang; Xiangwei Deng; Wenyu Yang; Wenhan Ma; Weiwei Lv; Weiji Zhuang; Wei Liu; Sirui Deng; Shuo Liu; Shimao Chen; Shihua Yu; Shaohui Liu; Shande Wang; Rui Ma; Qiantong Wang; Peng Wang; Nuo Chen; Menghang Zhu; Kangyang Zhou; Kang Zhou; Kai Fang; Jun Shi; Jinhao Dong; Jiebao Xiao; Jiaming Xu; Huaqiu Liu; Hongshen Xu; Heng Qu; Haochen Zhao; Hanglong Lv; Guoan Wang; Duo Zhang; Dong Zhang; Di Zhang; Chong Ma; Chang Liu; Can Cai; Bingquan Xia

arXiv:2506.03569·cs.CL·June 5, 2025

MiMo-VL Technical Report

Xiaomi LLM-Core Team: Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng

PDF

Open Access 1 Repo 6 Models 1 Datasets

TL;DR

MiMo-VL-7B models achieve state-of-the-art performance across diverse vision-language tasks through extensive pre-training and mixed reinforcement learning, setting new benchmarks in visual understanding and multimodal reasoning.

Contribution

Introduction of MiMo-VL-7B models with innovative training combining multi-stage pre-training and mixed RL, along with a comprehensive evaluation suite for reproducibility.

Findings

01

MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35/40 tasks.

02

Achieves 59.4 on OlympiadBench, surpassing larger models.

03

Sets new standards in GUI grounding with 56.1 on OSWorld-G.

Abstract

We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaomimimo/mimo-vl
pytorchOfficial

Models

Datasets

Septzzz/MMR-Life
dataset· 199 dl
199 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics