FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

Dawei Guan; Di Yang; Chengjie Jin; Jiangtao Wang

arXiv:2604.11083·cs.CV·April 21, 2026

FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

Dawei Guan, Di Yang, Chengjie Jin, Jiangtao Wang

PDF

TL;DR

FlowCoMotion introduces a unified token-latent flow modeling framework for text-to-motion generation, effectively capturing semantic content and motion details through combined continuous and discrete representations.

Contribution

It proposes a novel token-latent coupling approach that unifies continuous and discrete motion representations for improved text-to-motion synthesis.

Findings

01

Achieves competitive results on HumanML3D and SnapMoGen benchmarks.

02

Effectively captures both semantic and fine-grained motion details.

03

Outperforms existing methods in text-to-motion generation tasks.

Abstract

Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.