HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen; Qing Shuai; Di Kang; Jing Li; Cheng Wen; Yue Qian; Ningxin Jiao; Changhai Chen; Weijie Chen; Yiran Wang; Jinkun Guo; Dongyue An; Han Liu; Yanyu Tong; Chao Zhang; Qing Guo; Juan Chen; Qiao Zhang; Youyi Zhang; Zihao Yao; Cheng Zhang; Hong Duan; Xiaoping Wu; Qi Chen; Fei Cheng; Liang Dong; Peng He; Hao Zhang; Jiaxin Lin; Chao Zhang; Zhongyi Fan; Yifan Li; Zhichao Hu; Yuhong Liu; Linus; Jie Jiang; Xiaolong Li; Linchao Bao

arXiv:2512.23464·cs.CV·December 30, 2025

HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, Jinkun Guo, Dongyue An, Han Liu, Yanyu Tong, Chao Zhang, Qing Guo, Juan Chen, Qiao Zhang, Youyi Zhang, Zihao Yao, Cheng Zhang, Hong Duan, Xiaoping Wu, Qi Chen

PDF

Open Access 2 Models

TL;DR

HY-Motion 1.0 is a large-scale, diffusion transformer-based model that generates diverse 3D human motions from text, achieving state-of-the-art performance through extensive training and data processing.

Contribution

First to scale diffusion transformer flow matching models to billion-parameter size for text-to-motion generation, with a comprehensive training pipeline and extensive motion coverage.

Findings

01

Outperforms existing open-source benchmarks in instruction-following accuracy.

02

Generates over 200 motion categories across 6 major classes.

03

Achieves high-quality, diverse 3D human motions from textual descriptions.

Abstract

We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis