Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu; Kaiyang Ji; Ke Yang; Yahao Fan; Jingyi Yu; Ye Shi; Jingya Wang

arXiv:2511.22963·cs.RO·May 12, 2026

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu, Kaiyang Ji, Ke Yang, Yahao Fan, Jingyi Yu, Ye Shi, Jingya Wang

PDF

TL;DR

Humanoid-LLA is a large language action model that translates natural language into stable, diverse, and physically plausible humanoid robot motions, advancing human-robot interaction.

Contribution

It introduces a unified motion vocabulary and a two-stage fine-tuning framework combining supervised learning and reinforcement learning with physical feedback.

Findings

01

Achieves superior generalization to new language commands.

02

Generates diverse and physically stable motions.

03

Performs well in both simulation and real-world tests.

Abstract

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.