ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia; Jianfei Song; Yuan Zhang; Honglei Jin; Youcheng Fan; Wenshuo Chen; Wei Zhang; and Yutao Yue

arXiv:2603.16188·cs.CV·March 18, 2026

ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan, Wenshuo Chen, Wei Zhang, and Yutao Yue

PDF

Open Access

TL;DR

ECHO is a novel edge-cloud framework enabling natural language-driven humanoid robot control by synthesizing and executing motion sequences with high accuracy and safety, demonstrated through real-world experiments.

Contribution

The paper introduces a new integrated system combining cloud-based motion generation with edge-based execution for humanoid robots, improving flexibility and robustness.

Findings

01

High-quality motion generation with low latency (FID 0.029)

02

Successful real-world deployment on a humanoid robot without fine-tuning

03

Effective fall recovery and safety mechanisms implemented

Abstract

We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Robot Manipulation and Learning · Human Pose and Action Recognition