Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo; Wanpeng Zhang; Yicheng Feng; Sipeng Zheng; Haiweng Xu; Chaoyi Xu; Ziheng Xi; Yuhui Fu; Zongqing Lu

arXiv:2605.00078·cs.RO·May 4, 2026

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, Zongqing Lu

PDF

TL;DR

Being-H0.7 introduces a latent world-action model that enhances visual-language-action policies with future-aware reasoning without pixel-space prediction, improving efficiency and performance.

Contribution

It proposes a novel latent reasoning interface with dual-branch training, enabling future-aware policy inference without visual rollouts.

Findings

01

Achieves state-of-the-art or comparable results across six benchmarks.

02

Combines benefits of world models with direct VLA policy efficiency.

03

Discards the posterior branch at inference for deployability.

Abstract

Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.