Fleximo: Towards Flexible Text-to-Human Motion Video Generation
Yuhang Zhang, Yuan Zhou, Zeyu Liu, Yuxuan Cai, Qiuyue Wang, Aidong, Men, Huan Yang

TL;DR
Fleximo introduces a flexible, text-driven approach for generating human motion videos from images and language, overcoming pose detection limitations and requiring minimal reference data.
Contribution
The paper presents Fleximo, a novel framework that leverages large-scale pre-trained text-to-3D motion models with new rescaling, skeleton adaptation, and refinement techniques for improved video generation.
Findings
Outperforms existing image-to-video methods in quality and accuracy
Introduces MotionBench benchmark with 400 videos and 20 motions
Proposes MotionScore metric for motion accuracy evaluation
Abstract
Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsAdapter
