Temporal Consistency-Aware Text-to-Motion Generation

Hongsong Wang; Wenjing Yan; Qiuxia Lai; Xin Geng

arXiv:2602.18057·cs.CV·March 11, 2026

Temporal Consistency-Aware Text-to-Motion Generation

Hongsong Wang, Wenjing Yan, Qiuxia Lai, Xin Geng

PDF

Open Access

TL;DR

This paper introduces TCA-T2M, a novel framework that enhances text-to-motion generation by ensuring temporal consistency across sequences, leading to more realistic and coherent human motion synthesis from natural language descriptions.

Contribution

It proposes a temporal consistency-aware VQ-VAE and a masked motion transformer, improving cross-sequence alignment and physical plausibility in T2M generation.

Findings

01

Achieves state-of-the-art results on HumanML3D and KIT-ML benchmarks.

02

Significantly improves temporal coherence and semantic alignment.

03

Reduces discretization artifacts with a kinematic constraint block.

Abstract

Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis