Semantic Co-Speech Gesture Synthesis and Real-Time Control for Humanoid Robots
Gang Zhang

TL;DR
This paper introduces a comprehensive system that generates semantically meaningful co-speech gestures and controls a humanoid robot in real-time, enhancing natural non-verbal communication capabilities.
Contribution
It presents a novel end-to-end framework combining gesture synthesis from speech with real-time robot control, integrating large language models, Motion-GPT, and a robust retargeting method.
Findings
Gestures are semantically appropriate and expressive.
The robot accurately executes complex, synchronized motions.
The system operates in real-time with high fidelity.
Abstract
We present an innovative end-to-end framework for synthesizing semantically meaningful co-speech gestures and deploying them in real-time on a humanoid robot. This system addresses the challenge of creating natural, expressive non-verbal communication for robots by integrating advanced gesture generation techniques with robust physical control. Our core innovation lies in the meticulous integration of a semantics-aware gesture synthesis module, which derives expressive reference motions from speech input by leveraging a generative retrieval mechanism based on large language models (LLMs) and an autoregressive Motion-GPT model. This is coupled with a high-fidelity imitation learning control policy, the MotionTracker, which enables the Unitree G1 humanoid robot to execute these complex motions dynamically and maintain balance. To ensure feasibility, we employ a robust General Motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Human Motion and Animation · Robot Manipulation and Learning
