MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

Yin Wang; Mu li; Zhiying Leng; Frederick W. B. Li; Xiaohui Liang

arXiv:2507.06590·cs.CV·July 10, 2025

MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

Yin Wang, Mu li, Zhiying Leng, Frederick W. B. Li, Xiaohui Liang

PDF

Open Access

TL;DR

MOST introduces a novel motion diffusion model that leverages temporal clip Banzhaf interaction to improve human motion generation from rare text prompts, addressing semantic coherence and redundancy issues.

Contribution

It presents the first formulation of temporal clip Banzhaf interaction for precise text-motion matching and integrates it into a diffusion-based framework for enhanced motion generation.

Findings

01

Achieves state-of-the-art performance in text-to-motion retrieval.

02

Effectively generates semantically consistent human motions from rare prompts.

03

Outperforms previous methods in qualitative and quantitative evaluations.

Abstract

We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Human Pose and Action Recognition