UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang

TL;DR
UniAct is a two-stage framework that enables humanoid robots to follow diverse multimodal instructions in real-time, improving success rates and generalization for versatile, human-like interaction.
Contribution
It introduces a unified approach combining a fine-tuned multimodal large language model with a causal streaming pipeline for real-time humanoid control.
Findings
19% improvement in zero-shot tracking success rate
Robust generalization across diverse real-world scenarios
Sub-500 ms latency in multimodal instruction execution
Abstract
A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Social Robot Interaction and HRI · Robot Manipulation and Learning
