Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li; Jingyang Chen; Yuxun Qu; Shijie Xu; Zhenru Lin; Junyou Zhu; Boshen Xu; Wenhui Tan; Pei Fu; Jianzhong Ju; Zhenbo Luo; Jian Luan

arXiv:2512.17436·cs.CV·December 23, 2025

Xiaomi MiMo-VL-Miloco Technical Report

Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan

PDF

Open Access

TL;DR

This paper introduces MiMo-VL-Miloco, a vision-language model optimized for smart-home environments that achieves high performance in home-scenario understanding and multimodal reasoning, with a novel training pipeline and open-source resources.

Contribution

The paper presents a specialized vision-language model for smart homes, a two-stage training method combining supervised fine-tuning and reinforcement learning, and open-source tools for real-world deployment.

Findings

01

Achieves leading F1 scores in gesture recognition and home-scenario understanding.

02

Outperforms baselines on multimodal reasoning benchmarks.

03

Enhances text reasoning through targeted home-scenario training.

Abstract

We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Speech and dialogue systems