TL;DR
LGTM introduces a two-stage diffusion-based pipeline that leverages large language models and body-part encoders to generate semantically accurate, locally aligned human motions from text descriptions, improving coherence and precision.
Contribution
The paper presents a novel Local-to-Global pipeline combining LLMs and diffusion models for improved text-to-motion generation accuracy.
Findings
Significant improvement in local semantic alignment of generated motions.
Enhanced overall coherence of full-body human motion synthesis.
Effective decomposition of global descriptions into part-specific narratives.
Abstract
In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs large language models (LLMs) to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
