Bilingual Text-to-Motion Generation: A New Benchmark and Baselines
Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang

TL;DR
This paper introduces BiHumanML3D, a bilingual text-to-motion benchmark, and proposes BiMD with Cross-Lingual Alignment to improve cross-lingual motion generation, demonstrating significant performance gains over existing models.
Contribution
The paper presents the first bilingual text-to-motion dataset and a novel baseline with explicit semantic alignment for cross-lingual motion synthesis.
Findings
BiMD with CLA achieves lower FID (0.045) compared to baselines (0.169)
BiMD with CLA attains higher R@3 (82.8%) versus baselines (80.8%)
The approach enables effective zero-shot code-switching motion generation.
Abstract
Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
