Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, Zongqing Lu

TL;DR
Being-M0.5 is a real-time, controllable vision-language-motion model that advances human motion generation by enabling fine-grained, diverse, and long-term sequence control, supported by a large-scale dataset and novel motion tokenization.
Contribution
It introduces Being-M0.5, the first real-time controllable VLMM with a new part-aware motion tokenization technique and a comprehensive large-scale dataset, HuMo100M.
Findings
Achieves state-of-the-art performance on multiple motion benchmarks.
Demonstrates real-time generation capabilities.
Provides detailed analysis and insights for future motion generation development.
Abstract
Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
