UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang; Zimo He; Wanhe Yu; Lexi Pang; Yunhao Li; Hongjie Li; Jieming Cui; Yuhan Li; Yizhou Wang; Yixin Zhu; Siyuan Huang

arXiv:2512.24321·cs.CV·January 1, 2026

UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

Nan Jiang, Zimo He, Wanhe Yu, Lexi Pang, Yunhao Li, Hongjie Li, Jieming Cui, Yuhan Li, Yizhou Wang, Yixin Zhu, Siyuan Huang

PDF

Open Access

TL;DR

UniAct is a two-stage framework that enables humanoid robots to follow diverse multimodal instructions in real-time, improving success rates and generalization for versatile, human-like interaction.

Contribution

It introduces a unified approach combining a fine-tuned multimodal large language model with a causal streaming pipeline for real-time humanoid control.

Findings

01

19% improvement in zero-shot tracking success rate

02

Robust generalization across diverse real-world scenarios

03

Sub-500 ms latency in multimodal instruction execution

Abstract

A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Social Robot Interaction and HRI · Robot Manipulation and Learning