ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li; Siheng Wang; Zeyu Zhang; Hao Tang

arXiv:2508.02605·cs.CV·October 7, 2025

ReMoMask: Retrieval-Augmented Masked Motion Generation

Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang

PDF

Open Access 2 Models

TL;DR

ReMoMask is a novel framework that significantly improves text-to-motion generation by integrating retrieval-augmented methods, biomechanical constraints, and efficient diffusion models, achieving state-of-the-art results.

Contribution

It introduces a unified approach combining a bidirectional momentum model, semantic spatio-temporal attention, and classifier-free guidance for enhanced motion synthesis.

Findings

01

Achieves 3.88% and 10.97% lower FID scores on benchmarks.

02

Demonstrates improved diversity and physical plausibility.

03

Outperforms previous methods in standard evaluations.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition