Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control
Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

TL;DR
This paper introduces DAJI, a hierarchical framework enabling humanoid robots to anticipate future physical states from language instructions, improving control and responsiveness.
Contribution
DAJI is the first framework to explicitly encode anticipatory joint intent from language, integrating future-aware control with language-conditioned humanoid motion.
Findings
Achieves 94.42% success in HumanML3D-style generation
Attains 0.152 subsequence FID on BABEL dataset
Demonstrates strong anticipatory control in streaming instructions
Abstract
Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
