OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Guowei Xu; Yuxuan Bian; Ailing Zeng; Mingyi Shi; Shaoli Huang; Wen Li; Lixin Duan; Qiang Xu

arXiv:2510.19789·cs.CV·October 23, 2025

OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu

PDF

Open Access

TL;DR

OmniMotion-X is a comprehensive multimodal framework that generates realistic, controllable whole-body motions from various inputs, supported by a large dataset and innovative conditioning strategies, advancing the state-of-the-art in motion synthesis.

Contribution

The paper introduces OmniMotion-X, a novel autoregressive diffusion transformer for multimodal motion generation, and presents OmniMoCap-X, the largest unified multimodal motion dataset with hierarchical annotations.

Findings

01

Outperforms existing methods across multiple tasks

02

Supports diverse multimodal inputs and control scenarios

03

Produces realistic, coherent, and long-duration motions

Abstract

This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications