FTMoMamba: Motion Generation with Frequency and Text State Space Models
Chengjian Li, Xiangbo Shu, Qiongjie Cui, Yazhou Yao, Jinhui Tang

TL;DR
FTMoMamba introduces a novel diffusion framework that leverages frequency and text state space models to improve human motion generation, capturing fine-grained motions and aligning text semantics with generated motions.
Contribution
The paper proposes FTMoMamba, a diffusion-based model with Frequency and Text State Space Models, to better capture motion details and semantic consistency in text-to-motion generation.
Findings
Achieves lowest FID of 0.181 on HumanML3D dataset.
Effectively decomposes motion into frequency components for detailed generation.
Aligns textual semantics with motion sequences for improved consistency.
Abstract
Diffusion models achieve impressive performance in human motion generation. However, current approaches typically ignore the significance of frequency-domain information in capturing fine-grained motions within the latent space (e.g., low frequencies correlate with static poses, and high frequencies align with fine-grained motions). Additionally, there is a semantic discrepancy between text and motion, leading to inconsistency between the generated motions and the text descriptions. In this work, we propose a novel diffusion-based FTMoMamba framework equipped with a Frequency State Space Model (FreqSSM) and a Text State Space Model (TextSSM). Specifically, to learn fine-grained representation, FreqSSM decomposes sequences into low-frequency and high-frequency components, guiding the generation of static pose (e.g., sits, lay) and fine-grained motions (e.g., transition, stumble),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsALIGN
