Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng; Xiaofeng Tan; Xiangbo Shu; Guo-Sen Xie; Pan Zhou; Hongsong Wang

arXiv:2603.25178·cs.CV·March 27, 2026

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines

Wanjiang Weng, Xiaofeng Tan, Xiangbo Shu, Guo-Sen Xie, Pan Zhou, Hongsong Wang

PDF

Open Access

TL;DR

This paper introduces BiHumanML3D, a bilingual text-to-motion benchmark, and proposes BiMD with Cross-Lingual Alignment to improve cross-lingual motion generation, demonstrating significant performance gains over existing models.

Contribution

The paper presents the first bilingual text-to-motion dataset and a novel baseline with explicit semantic alignment for cross-lingual motion synthesis.

Findings

01

BiMD with CLA achieves lower FID (0.045) compared to baselines (0.169)

02

BiMD with CLA attains higher R@3 (82.8%) versus baselines (80.8%)

03

The approach enables effective zero-shot code-switching motion generation.

Abstract

Text-to-motion generation holds significant potential for cross-linguistic applications, yet it is hindered by the lack of bilingual datasets and the poor cross-lingual semantic understanding of existing language models. To address these gaps, we introduce BiHumanML3D, the first bilingual text-to-motion benchmark, constructed via LLM-assisted annotation and rigorous manual correction. Furthermore, we propose a simple yet effective baseline, Bilingual Motion Diffusion (BiMD), featuring Cross-Lingual Alignment (CLA). CLA explicitly aligns semantic representations across languages, creating a robust conditional space that enables high-quality motion generation from bilingual inputs, including zero-shot code-switching scenarios. Extensive experiments demonstrate that BiMD with CLA achieves an FID of 0.045 vs. 0.169 and R@3 of 82.8\% vs. 80.8\%, significantly outperforms monolingual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis