MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui; Bokai Xu; Chongyi Wang; Tianyu Yu; Weiyue Sun; Yingjing Xu; Tianran Wang; Zhihui He; Wenshuo Ma; Tianchi Cai; Jiancheng Gui; Luoyuan Zhang; Xian Sun; Fuwei Huang; Moye Chen; Zhuo Lin; Hanyu Liu; Qingxin Gui; Qingzhe Han; Yuyang Wen; Huiping Liu; Rongkang Wang; Yaqi Zhang; Hongliang Wei; Chi Chen; You Li; Kechen Fang; Jie Zhou; Yuxuan Li; Guoyang Zeng; Chaojun Xiao; Yankai Lin; Xu Han; Maosong Sun; Zhiyuan Liu; Yuan Yao

arXiv:2604.27393·cs.CL·May 1, 2026

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang

PDF

1 Repo 26 Models

TL;DR

MiniCPM-o 4.5 introduces a real-time, full-duplex multimodal interaction framework enabling simultaneous perception and response, with proactive behaviors, surpassing existing models in vision-language and speech tasks at a small scale.

Contribution

The paper presents Omni-Flow, a unified streaming framework for full-duplex multimodal interaction, enabling simultaneous perception and response in a single, time-aligned process.

Findings

01

MiniCPM-o 4.5 achieves state-of-the-art vision-language performance at 9B parameters.

02

It surpasses Qwen3-Omni-30B-A3B in omni-modal understanding.

03

The model performs real-time interaction on edge devices with less than 12GB RAM.

Abstract

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-level multimodal interaction. The key bottlenecks are no longer modality coverage or latency alone, but the interaction paradigm itself. First, perception and response are still separated into alternating phases, preventing models from incorporating new inputs for timely adjustment during generation. Second, most current models remain reactive, responding only to explicit user requests instead of acting proactively in the evolving multimodal environment. We present MiniCPM-o 4.5, our latest effort towards human-like multimodal interaction, which mitigates these gaps by real-time full-duplex omni-modal interaction. It can see, listen, and speak simultaneously in real-time, while also exhibiting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openbmb/MiniCPM-o
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.