Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour; Morteza Ghahremani; Zinuo Li; Hamid Laga; Farid Boussaid; Mohammed Bennamoun

arXiv:2603.08028·cs.CV·March 10, 2026

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, Farid Boussaid, Mohammed Bennamoun

PDF

Open Access

TL;DR

This paper introduces a two-stage framework that generates complex human motion videos from text by first creating 2D skeleton sequences and then synthesizing videos, addressing the limitations of existing methods.

Contribution

It presents a novel cascaded approach combining text-to-skeleton and pose-conditioned video diffusion, along with a new synthetic dataset for complex human motion video generation.

Findings

01

Outperforms prior methods on FID, R-precision, and motion diversity.

02

Achieves state-of-the-art results on temporal consistency and motion smoothness.

03

Provides a new dataset with 2,000 synthetic videos of complex motions.

Abstract

Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition