PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition
Zeyu Ling, Qing Shuai, Teng Zhang, Shiyang Li, Bo Han, Changqing Zou

TL;DR
PRISM introduces a structured joint-factorized latent space and noise-free condition injection, enabling high-quality, versatile, and streaming human motion generation from text and pose inputs, with improved long-horizon synthesis.
Contribution
It proposes a novel joint-factorized latent space and a noise-free conditioning method, unifying multiple motion generation tasks in a single model.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Enables seamless text-to-motion and pose-conditioned generation.
Supports autoregressive streaming synthesis with reduced drift.
Abstract
Text-to-motion generation has advanced rapidly, yet two challenges persist. First, existing motion autoencoders compress each frame into a single monolithic latent vector, entangling trajectory and per-joint rotations in an unstructured representation that downstream generators struggle to model faithfully. Second, text-to-motion, pose-conditioned generation, and long-horizon sequential synthesis typically require separate models or task-specific mechanisms, with autoregressive approaches suffering from severe error accumulation over extended rollouts. We present PRISM, addressing each challenge with a dedicated contribution. (1) A joint-factorized motion latent space: each body joint occupies its own token, forming a structured 2D grid (time joints) compressed by a causal VAE with forward-kinematics supervision. This simple change to the latent space -- without modifying the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
