PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Zhongyu Jiang; Wenhao Chai; Zhuoran Zhou; Cheng-Yen Yang; Hsiang-Wei; Huang; Jenq-Neng Hwang

arXiv:2501.16551·cs.CV·January 29, 2025

PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei, Huang, Jenq-Neng Hwang

PDF

Open Access

TL;DR

PackDiT is a novel diffusion-based model that jointly generates human motion and text, enabling bidirectional tasks like motion-to-text and text-to-motion with state-of-the-art performance.

Contribution

The paper introduces the first diffusion model capable of simultaneous multi-modal human motion and text generation, utilizing mutual blocks for seamless integration across modalities.

Findings

01

Achieved state-of-the-art text-to-motion FID score of 0.106.

02

Demonstrated effective bidirectional motion-text generation.

03

Outperformed previous models in motion prediction and in-between tasks.

Abstract

Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsDiffusion