PackDiT: Joint Human Motion and Text Generation via Mutual Prompting
Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei, Huang, Jenq-Neng Hwang

TL;DR
PackDiT is a novel diffusion-based model that jointly generates human motion and text, enabling bidirectional tasks like motion-to-text and text-to-motion with state-of-the-art performance.
Contribution
The paper introduces the first diffusion model capable of simultaneous multi-modal human motion and text generation, utilizing mutual blocks for seamless integration across modalities.
Findings
Achieved state-of-the-art text-to-motion FID score of 0.106.
Demonstrated effective bidirectional motion-text generation.
Outperformed previous models in motion prediction and in-between tasks.
Abstract
Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Hand Gesture Recognition Systems
MethodsDiffusion
