Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Haozhe Jia; Honglei Jin; Yuan Zhang; Youcheng Fan; Shaofeng Liang; Lei Wang; Shuxu Jin; Kuimou Yu; Zinuo Zhang; Jianfei Song; Wenshuo Chen; Yutao Yue

arXiv:2605.14417·cs.RO·May 21, 2026

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

PDF

TL;DR

This paper introduces DAJI, a hierarchical framework enabling humanoid robots to anticipate future physical states from language instructions, improving control and responsiveness.

Contribution

DAJI is the first framework to explicitly encode anticipatory joint intent from language, integrating future-aware control with language-conditioned humanoid motion.

Findings

01

Achieves 94.42% success in HumanML3D-style generation

02

Attains 0.152 subsequence FID on BABEL dataset

03

Demonstrates strong anticipatory control in streaming instructions

Abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.