Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Bin Cao; Sipeng Zheng; Ye Wang; Lujie Xia; Qianshan Wei; Qin Jin; Jing Liu; Zongqing Lu

arXiv:2508.07863·cs.CV·August 12, 2025

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

PDF

Open Access

TL;DR

Being-M0.5 is a real-time, controllable vision-language-motion model that advances human motion generation by enabling fine-grained, diverse, and long-term sequence control, supported by a large-scale dataset and novel motion tokenization.

Contribution

It introduces Being-M0.5, the first real-time controllable VLMM with a new part-aware motion tokenization technique and a comprehensive large-scale dataset, HuMo100M.

Findings

01

Achieves state-of-the-art performance on multiple motion benchmarks.

02

Demonstrates real-time generation capabilities.

03

Provides detailed analysis and insights for future motion generation development.

Abstract

Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis