Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Chao Xu; Mingze Sun; Zhi-Qi Cheng; Fei Wang; Yang Liu; Baigui Sun; Ruqi Huang; Alexander Hauptmann

arXiv:2408.09397·cs.CV·September 22, 2025

Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony

Chao Xu, Mingze Sun, Zhi-Qi Cheng, Fei Wang, Yang Liu, Baigui Sun, Ruqi Huang, Alexander Hauptmann

PDF

Open Access

TL;DR

This paper introduces Combo, a framework for co-speech 3D human motion generation that effectively handles multiple inputs and outputs, enabling customizable and harmonious facial and body movements with efficient adaptation.

Contribution

The paper presents a novel transformer-based design and a parameter-efficient fine-tuning method for holistic 3D human motion generation and adaptation.

Findings

01

High-quality motion generation demonstrated on BEAT2 and SHOW datasets.

02

Efficient transfer of identity and emotion in generated motions.

03

Effective coordination of facial expressions and body movements.

Abstract

In this paper, we propose a novel framework, Combo, for harmonious co-speech holistic 3D human motion generation and efficient customizable adaption. In particular, we identify that one fundamental challenge as the multiple-input-multiple-output (MIMO) nature of the generative model of interest. More concretely, on the input end, the model typically consumes both speech signals and character guidance (e.g., identity and emotion), which not only poses challenge on learning capacity but also hinders further adaptation to varying guidance; on the output end, holistic human motions mainly consist of facial expressions and body movements, which are inherently correlated but non-trivial to coordinate in current data-driven generation process. In response to the above challenge, we propose tailored designs to both ends. For the former, we propose to pre-train on data regarding a fixed identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music Technology and Sound Studies · Human Pose and Action Recognition