MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks
Yiming Wu, Wei Ji, Kecheng Zheng, Zicheng Wang, Dong Xu

TL;DR
MoTe is a unified multi-modal diffusion model that effectively handles various motion and text generation tasks by learning joint, marginal, and conditional distributions, demonstrating superior results on benchmarks.
Contribution
The paper introduces MoTe, a novel model that unifies multiple motion-text tasks within a single framework using diffusion models and multi-modal encoders.
Findings
Superior performance on text-to-motion generation
Competitive results on motion captioning
Effective multi-task learning with a single model
Abstract
Recently, human motion analysis has experienced great improvement due to inspiring generative models such as the denoising diffusion model and large language model. While the existing approaches mainly focus on generating motions with textual descriptions and overlook the reciprocal task. In this paper, we present~\textbf{MoTe}, a unified multi-modal model that could handle diverse tasks by learning the marginal, conditional, and joint distributions of motion and text simultaneously. MoTe enables us to handle the paired text-motion generation, motion captioning, and text-driven motion generation by simply modifying the input context. Specifically, MoTe is composed of three components: Motion Encoder-Decoder (MED), Text Encoder-Decoder (TED), and Moti-on-Text Diffusion Model (MTDM). In particular, MED and TED are trained for extracting latent embeddings, and subsequently reconstructing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsFocus · Diffusion
