MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Yannan He; Garvita Tiwari; Xiaohan Zhang; Pankaj Bora; Tolga Birdal; Jan Eric Lenssen; Gerard Pons-Moll

arXiv:2512.13840·cs.CV·March 27, 2026

MoLingo: Motion-Language Alignment for Text-to-Motion Generation

Yannan He, Garvita Tiwari, Xiaohan Zhang, Pankaj Bora, Tolga Birdal, Jan Eric Lenssen, Gerard Pons-Moll

PDF

Open Access

TL;DR

MoLingo introduces a novel text-to-motion model that uses semantic alignment and cross-attention to generate realistic human motions closely matching textual descriptions, advancing the state of the art.

Contribution

The paper proposes a semantic-aligned motion encoder and cross-attention conditioning to improve diffusion-based text-to-motion generation.

Findings

01

Semantic alignment improves diffusion effectiveness

02

Cross-attention enhances motion realism and text alignment

03

Achieves state-of-the-art results on standard metrics

Abstract

We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · 3D Shape Modeling and Analysis