TextIM: Part-aware Interactive Motion Synthesis from Text
Siyuan Fan, Bo Du, Xiantao Cai, Bo Peng, Longling Sun

TL;DR
TextIM is a new framework that synthesizes human interactive motions from text with precise part-level semantic alignment, improving realism and accuracy in complex interaction scenarios.
Contribution
It introduces a decoupled diffusion model and a part-aware spatial coherence module for detailed, semantically accurate motion synthesis from textual descriptions.
Findings
Significantly improves motion realism and semantic accuracy.
Effectively models complex interactions with deformable objects.
Outperforms existing methods in diverse interaction scenarios.
Abstract
In this work, we propose TextIM, a novel framework for synthesizing TEXT-driven human Interactive Motions, with a focus on the precise alignment of part-level semantics. Existing methods often overlook the critical roles of interactive body parts and fail to adequately capture and align part-level semantics, resulting in inaccuracies and even erroneous movement outcomes. To address these issues, TextIM utilizes a decoupled conditional diffusion framework to enhance the detailed alignment between interactive movements and corresponding semantic intents from textual descriptions. Our approach leverages large language models, functioning as a human brain, to identify interacting human body parts and to comprehend interaction semantics to generate complicated and subtle interactive motion. Guided by the refined movements of the interacting parts, TextIM further extends these movements into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
MethodsDiffusion · Focus · ALIGN
