ReMoMask: Retrieval-Augmented Masked Motion Generation
Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

TL;DR
ReMoMask is a novel framework that significantly improves text-to-motion generation by integrating retrieval-augmented methods, biomechanical constraints, and efficient diffusion models, achieving state-of-the-art results.
Contribution
It introduces a unified approach combining a bidirectional momentum model, semantic spatio-temporal attention, and classifier-free guidance for enhanced motion synthesis.
Findings
Achieves 3.88% and 10.97% lower FID scores on benchmarks.
Demonstrates improved diversity and physical plausibility.
Outperforms previous methods in standard evaluations.
Abstract
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
