CASIM: Composite Aware Semantic Injection for Text to Motion Generation
Che-Jui Chang, Qingze Tony Liu, Honglu Zhou, Vladimir Pavlovic,, Mubbasir Kapadia

TL;DR
CASIM introduces a novel composite-aware semantic injection mechanism that enhances text-to-motion generation by improving semantic understanding, resulting in better motion quality, alignment, and controllability across various models.
Contribution
The paper presents CASIM, a model-agnostic, composite-aware semantic injection mechanism that significantly improves text-to-motion generation quality and controllability over existing fixed-length embedding methods.
Findings
CASIM improves motion quality and alignment scores on benchmarks.
CASIM enables more precise motion control from text prompts.
CASIM generalizes well to unseen text inputs.
Abstract
Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Handwritten Text Recognition Techniques · Human Pose and Action Recognition
MethodsAttentive Walk-Aggregating Graph Neural Network
