ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control
Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan, Wenshuo Chen, Wei Zhang, and Yutao Yue

TL;DR
ECHO is a novel edge-cloud framework enabling natural language-driven humanoid robot control by synthesizing and executing motion sequences with high accuracy and safety, demonstrated through real-world experiments.
Contribution
The paper introduces a new integrated system combining cloud-based motion generation with edge-based execution for humanoid robots, improving flexibility and robustness.
Findings
High-quality motion generation with low latency (FID 0.029)
Successful real-world deployment on a humanoid robot without fine-tuning
Effective fall recovery and safety mechanisms implemented
Abstract
We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Robot Manipulation and Learning · Human Pose and Action Recognition
