Xiaomi MiMo-VL-Miloco Technical Report
Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan

TL;DR
This paper introduces MiMo-VL-Miloco, a vision-language model optimized for smart-home environments that achieves high performance in home-scenario understanding and multimodal reasoning, with a novel training pipeline and open-source resources.
Contribution
The paper presents a specialized vision-language model for smart homes, a two-stage training method combining supervised fine-tuning and reinforcement learning, and open-source tools for real-world deployment.
Findings
Achieves leading F1 scores in gesture recognition and home-scenario understanding.
Outperforms baselines on multimodal reasoning benchmarks.
Enhances text reasoning through targeted home-scenario training.
Abstract
We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Speech and dialogue systems
