Exploring Motion-Language Alignment for Text-driven Motion Generation
Ruxi Gu, Zilei Wang, Wei Wang

TL;DR
This paper introduces MLA-Gen, a framework for text-driven human motion generation that improves alignment between textual descriptions and generated motions by addressing attention issues and integrating motion priors.
Contribution
The paper proposes MLA-Gen, a novel framework that enhances motion-language alignment and addresses attention sink problems in text-to-motion generation.
Findings
MLA-Gen outperforms strong baselines in motion quality.
The SinkRatio metric effectively measures attention concentration.
Alignment-aware strategies improve semantic grounding in generated motions.
Abstract
Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
